PROCEEDINGS OF THE SAWTOOTH SOFTWARE CONFERENCE September 2001 Copyright 2002 All rights reserved. This electronic document may be copied or printed for personal use only. Copies or reprints may not be sold without permission in writing from Sawtooth Software, Inc. FOREWORD The ninth Sawtooth Software Conference, held in Victoria, BC on September 12-14, 2001, will be forever remembered due to the tragic events of the prior day, September 11, 2001. It is ironic that we had moved the date of the conference to September 12th to avoid an earlier scheduled WTO meeting in Victoria, for fear that protests might somehow disrupt transportation to the conference. Instead, in response to the terrorist attacks on the World Trade Center, the Pentagon, and another hijacking that ended in a fatal crash in Pennsylvania, the FAA grounded all commercial flights in North America. About half our attendees had already made it to Victoria, but many were unable to leave home, and so many others were stranded in airports around the world. We were impressed by the professionalism of those who spent those difficult days with us in Victoria. On breaks between sessions, we were riveted to the television set in the registration room, collectively shaking our heads. Marooned together, we went ahead with a successful conference. Technology came to the rescue for some speakers unable to make it, as we piped their voices into the ballroom via phone line and advanced the PowerPoint slides on the screen as they spoke. A few planned speakers were unable to deliver a presentation at all. Due to the circumstances, no best paper award was given this year. Many were deserving, as you will discover. The papers presented in this volume are in the words of the authors, and we have performed very little copy editing. We wish to express our sincere thanks to the authors and discussants whose dedication and efforts made this very unusual 2001 Conference a success. Some of the papers presented at this and previous conferences are available in electronic form at our Technical Papers Library on our home page: http://www.sawtoothsoftware.com. Sawtooth Software February, 2002 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. CONTENTS KNOWLEDGE AS OUR DISCIPLINE ........................................................................................... 1 Chuck Chakrapani, Ph.D, Standard Research Systems / McMaster University, Toronto, Canada PARADATA: A TOOL FOR QUALITY IN INTERNET INTERVIEWING................................................... 15 Ray Poynter, The Future Place, and Deb Duncan, Millward Brown IntelliQuest WEB INTERVIEWING: WHERE ARE WE IN 2001?...................................................................... 27 Craig V. King and Patrick Delana, POPULUS USING CONJOINT ANALYSIS IN ARMY RECRUITING ................................................................ 35 Todd M. Henry, United States Military Academy, and Claudia G. Beach, United States Army Recruiting Command DEFENDING DOMINANT SHARE: USING MARKET SEGMENTATION AND CUSTOMER RETENTION MODELING TO MAINTAIN MARKET LEADERSHIP ...................................................................... 47 Michal G. Mulhern, Ph.D, Mulhern Consulting ACA/CVA IN JAPAN: AN EXPLORATION OF THE DATA IN A CULTURAL FRAMEWORK .................. 59 Brent Soo Hoo, Gartner/Griggs-Anderson, Nakaba Matsushima and Kiyoshi Fukai, Nikkei Research A METHODOLOGICAL STUDY TO COMPARE ACA WEB AND ACA WINDOWS INTERVIEWING........ 85 Aaron Hill and Gary Baker, Sawtooth Software, Inc., and Tom Pilon, TomPilon.com INCREASING THE VALUE OF CHOICE-BASED CONJOINT WITH “BUILD YOUR OWN” CONFIGURATION QUESTIONS ...................................................................................................................... 99 David G. Bakken, Ph.D, and Len Bayer, Harris Interactive APPLIED PRICING RESEARCH ............................................................................................. 111 Jay L. Weiner, Ph.D., Ipsos North America RELIABILITY AND COMPARABILITY OF CHOICE-BASED MEASURES: ONLINE AND PAPER-AND-PENCIL METHODS OF ADMINISTRATION ......................................................................................... 123 Thomas W. Miller, A.C. Nielsen Center, School of Business, University of Wisconsin-Madison, David Rake, Reliant Energy, Takashi Sumimoto, Harris Interactive, and Peggy S. Hollman, General Mills TRADE-OFF STUDY SAMPLE SIZE: HOW LOW CAN WE GO? ................................................... 131 Dick McCullough, MACRO Consulting, Inc. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. i THE EFFECTS OF DISAGGREGATION WITH PARTIAL PROFILE CHOICE EXPERIMENTS ....................... 151 Jon Pinnell and Lisa Fridley, MarketVision Research ONE SIZE FITS ALL OR CUSTOM TAILORED: WHICH HB FITS BETTER?......................................... 167 Keith Sentis and Lihua Li, Pathfinder Strategies MODELING CONSTANT SUM DEPENDENT VARIABLES WITH MULTINOMIAL LOGIT: A COMPARISON OF FOUR METHODS .............................................................................................................. 177 Keith Chrzan, ZS Associates, and Sharon Alberg, Maritz Research DEPENDENT CHOICE MODELING OF TV VIEWING BEHAVIOR .................................................. 185 Maarten Schellekens, McKinsey & Company / Intomart BV ALTERNATIVE SPECIFICATIONS TO ACCOUNT FOR THE “NO-CHOICE” ALTERNATIVE IN CONJOINT CHOICE EXPERIMENTS ...................................................................................................... 195 Rinus Haaijer, MuConsult, Michel Wedel, University of Groningen and Michigan, and Wagner Kamakura, Duke University HISTORY OF ACA ........................................................................................................... 205 Richard M. Johnson, Sawtooth Software, Inc. A HISTORY OF CHOICE-BASED CONJOINT .......................................................................... 213 Joel Huber, Duke University RECOMMENDATIONS FOR VALIDATION OF CHOICE MODELS .................................................. 225 Terry Elrod, University of Alberta ii 2001 Sawtooth Software Conference Proceedings: Sequim, WA. SUMMARY OF FINDINGS We distilled some of the key points and findings from each presentation below. Knowledge as Our Discipline (Chuck Chakrapani): Chuck observed that many market researchers have become simply “order takers” rather than having real influence within organizations. He claims that very early on marketing research made the mistake of defining its role too narrowly. Broadening the scope of influence includes helping managers ask the right questions and becoming more knowledgeable of the businesses that market researchers are consulting. As evidence of the poor state of marketing research, Chuck showed how many management and marketing texts virtually ignore the marketing research function as important to the business process. Chuck argued that, as opposed to other sciences, market researchers have not mutually developed a core set of knowledge about the law-like relationships within their discipline. The reasons include of a lack of immediate rewards for compiling such knowledge, and an overconcern for confidentiality within organizations. Chuck decried “black-box” approaches to market research. The details of “black-box” approaches are confidential, and therefore the validity of such approaches cannot be truly challenged or tested. He argued that widespread practice of “Sonking” (Scientification of Non-Knowledge) in the form of sophisticated-looking statistical models devoid of substantial empirical content has obscured true fact-finding and ultimately lessened market researchers’ value and influence. Paradata: A Tool for Quality in Internet Interviewing (Ray Poynter and Deb Duncan): Ray and Deb showed how Paradata (information about the process) can be used for fine-tuning and improving on-line research. Time to complete the interview, number of abandoned interviews at each question in the survey, and internal fit statistics are all examples of Paradata. The authors reported that complex “grid” style questions, constant-sum questions, and open-end questions that required respondents to type a certain number of characters resulted in many more drop-outs within on-line surveys. In addition to observing the respondent’s answers, the authors pointed out that much information can be learned by “asking” the respondent’s browser questions. Ray and Deb called this “invisible” data. Examples include: current screen resolution, browser version, operating system, Java enabled or not. Finally, the authors suggested that researchers pay close attention to privacy issues: posting privacy policies on their sites and faithfully abiding by those guidelines. Web Interviewing: Where are We in 2001? (Craig King and Patrick Delana): Craig and Patrick reported their experiences with Web interviewing (over 230,000 interviews over the last two years). Most of their research has involved employee interviews at large companies, for which they report about a 50% response rate. The authors have also been involved in more traditional market research studies, for which they often find Web response rates of about 20% after successful qualification by a phone screener, but less than 5% for “targeted” lists of IT professionals. They suggested that the best way to improve response rates is by giving cash to each respondent, though they noted that this is more expensive to process than cash drawings. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. iii The authors reported findings from other research suggesting that paper-and-pencil and Web interviews usually produce quite similar findings. They reported on a split-sample study they conducted which demonstrated virtually no difference between the response patterns to a 47question battery of satisfaction questions for Web and conventional mail surveys. The authors also shared some nuts-and-bolts advice, such as to be careful about asking respondents to type alpha-numeric passwords where there can be confusion. Examples include 0, o, O, vv, WW, the number “1” vs. small “L”. Using Conjoint Analysis in Army Recruiting (Todd Henry and Claudia Beach): The Army has found it increasingly difficult to recruit 17- to 22-year olds. Three reasons include low unemployment rate, a decrease in the propensity among youth to serve, and an increase in the number of young people attending college. As a result, the Army has had to offer increased incentives to entice people to enlist. The authors described how they used the results of a CBC study of potential enlistees to allocate enlistment incentives across different military career paths and enlistment periods. They utilized CBC utilities within a goal program with the aim to allocate resources to minimize deviations from the recruiting quotas for each occupational specialty. They found that their model did a good job of estimating recruits at the shorter terms of service, but over-estimated recruit preference for longer terms of service. They discussed some of the challenges of using the conjoint data to predict enlistment rates. One in particular is that not all career paths are available to every enlistee, as they are in the CBC interview. Enlistees must meet certain requirements to be accepted into many of the specialties. Defending Dominant Share: Using Market Segmentation and Customer Retention Modeling to Maintain Market Leadership (Mike Mulhern): Mike provided a case study demonstrating how segmentation followed by customer retention modeling could help a firm maintain market leadership. One of Mike’s most important points was the distinction between using intention to re-purchase instead of customer satisfaction as the dependent variable. He argued that intention to re-purchase was better linked to behavior than satisfaction. Mike described the process he used to build the retention models for each segment. After selecting logistic regression as his primary modeling tool, Mike discussed how he evaluated and improved the models. In this research, improvements to the models were made by managing multicollinearity with factor analysis, recoding the dependent variable to ensure variation, testing the independent variables for construct validity, and employing regression diagnostics. The diagnostic measures improved the model by identifying outliers and cases that had excessive influence. Examples from the research were used to illustrate how these diagnostic measures helped improve model quality. ACA/CVA in Japan: An Exploration of the Data in a Cultural Framework (Brent Soo Hoo, Nakaba Matsushima, and Kiyoshi Fukai): Brent and his co-authors cautioned researchers to pay attention to cultural differences prior to using conjoint analysis across countries. As one example, they pointed out some of the characteristics of Japanese people that are quite unique to that country and might affect conjoint results. For example, the Japanese people’s reluctance to be outspoken (using the center rather than extreme points on scales) might result in lower quality conjoint data. iv 2001 Sawtooth Software Conference Proceedings: Sequim, WA. They tested the hypothesis that Japanese people tended to use the center part of the 9-point graded comparison scale in ACA and CVA. They found at least some evidence for this behavior, but did not find proof that the resulting ACA utilities were less valid than among countries that tend to use, to a greater extent, the full breadth of the scale. A Methodological Study to Compare ACA Web and ACA Windows Interviewing (Aaron Hill, Gary Baker, and Tom Pilon): Aaron and his co-authors undertook a pilot research study among 120 college students to test whether the results of two new software systems (ACA for Windows and ACA for Web) were equivalent. They configured the two computerized interviews to look nearly identical (fonts, colors, and scales) for the self-explicated priors section and the pairs section. The ACA for Windows interview provided greater flexibility in the design of its calibration concept questions, so a new slider scale was tested. The authors found no substantial differences among the utilities for the two approaches, suggesting that researchers can employ mixed modality studies with ACA (Web/Windows) and simply combine the results. Respondents were equally comfortable with either survey method, and took equal time to complete them. The authors suggested that respondents more comfortable completing Web surveys could be given a Web-based interview, whereas others might be sent a disk in the mail, be invited to a central site, or could be visited by an interviewer carrying a laptop. As seen in many other studies, HB improved the results over traditional ACA utility estimation. Other tentative findings were as follows: self-explicated utilities alone did quite well in predicting individuals’ choices to holdout tasks —but the addition of pairs and HB estimation further improved the predictability of the utilities; the calibration concept question can be skipped if the researcher uses HB and does not need to run purchase likelihood simulations; and the slider scale for calibration concepts may result in more reliable purchase likelihood scaling among respondents comfortable with using the mouse. Increasing the Value of Choice-Based Conjoint with “Build Your Own” Configuration Questions (David Bakken and Len Bayer): David and Len showed how computerized questionnaires can include a “Build Your Own” (BYO) product question. In the BYO question, respondents can configure the product that they are most likely to buy by choosing a level from each attribute. Each level is associated with an incremental price, and the total price is recalculated each time a new feature is selected. Even though clients tend to like BYO questions a great deal, David and Len suggest that the actual data from the BYO task may be of limited value. The authors presented the results of a study that compared traditional Choice-Based Conjoint results to BYO questions. They found only a loose relationship between the information of the two methods. They concluded that a BYO question may serve a good purpose for product categories in which buyers truly purchase the product in a BYO fashion, but that larger sample sizes than traditional conjoint are needed. Furthermore, experimental treatments (e.g. variations in price for each feature) might be needed either within or between subjects to improve the value of the BYO task. Between-subjects designs would increase sample size demands. David and Len pointed out that the BYO focuses respondents on trading off each feature versus price rather than 2001 Sawtooth Software Conference Proceedings: Sequim, WA. v trading features off against another. The single trade-off versus price may reflect a different cognitive process than the multi-attribute trade-off that characterizes a choice experiment. Applied Pricing Research (Jay Weiner): Jay reviewed the common approaches to pricing research in marketing research: willingness to pay questions, monadic designs, the van Westendorp technique, conjoint analysis, and discrete choice. Jay argued that most products exhibit a range of inelasticity —and finding that range of inelasticity is one of the main goals of pricing research. Demand may fall, but total revenue can increase over those limited ranges. Jay compared the results of monadic concept tests and the van Westendorp technique. He concluded that the van Westendorp technique did a reasonable job of predicting actual trial for a number of FMCG categories. Even though he didn’t present data on the subject, he suggested that the fact that CBC offers a competitive context may improve the results relative to other pricing methods. Reliability and Comparability of Choice-Based Measures: Online and Paper-and-Pencil Methods of Administration (Tom Miller, David Rake, Takashi Sumimoto, and Peggy Hollman): Tom and his co-authors presented evidence that the usage of on-line surveys is expected to grow significantly in the near future. They also pointed out that some studies, particularly comparing web interviewing with telephone research, show that different methods of interviewing respondents may yield different results. These differences may be partly due to social desirability issues, since telephone respondents are communicating with a human rather than a computer. Tom and his co-authors reported on a carefully designed split-sample study that compared the reliability of online and paper-and-pencil discrete choice analysis. Student respondents from the University of Wisconsin were divided into eight design cells. Respondents completed both paper-and-pencil and CBC tasks, in different orders. The CBC interview employed a fixed design in which respondents saw each task twice, permitting a test-retest condition for each task. The authors found no significant differences between paper-and-pencil administration and online CBC. Tom and his colleagues concluded that for populations in which respondents were comfortable with on-line technology, either method should produce equivalent results. Trade-Off Study Sample Size: How Low Can We Go? (Dick McCullough): In market research, the decision regarding sample size is often one of the thorniest. Clients have a certain budget and often a sample size in mind based on past experience. Different conjoint analysis methods provide varying degrees of precision given a certain sample size. Dick compared the stability of conjoint information as one reduces the sample size. He compared Adaptive Conjoint Analysis (ACA), traditional ratings-based conjoint (CVA), and Choice-Based Conjoint (CBC). Both traditional and Hierarchical Bayes analyses were tested. Dick used actual data sets with quite large sample sizes (N>400). He randomly chose subsets of the sample for analysis, and compared the results each time to the full sample. The criterion for fit was how well the utilities from the sub-sample matched the utilities for the entire sample, and how well market simulations for hypothetical market scenarios for the sub-sample matched the entire sample. Because the data sets were not specifically designed for this research, Dick faced challenges in drawing firm conclusions regarding the differences in conjoint approaches and sample size. vi 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Despite the limitations, Dick’s research suggests that ACA data are more stable than CBC data (given the same sample size). His findings also suggest that conjoint researchers may be able to significantly reduce sample sizes without great losses in information. Especially for preliminary exploratory research, sample sizes as small as 30 or even less may yield valid insights into the population of interest. In the discussion following the presentation, Greg Allenby of Ohio State (considered the foremost expert in applying HB to marketing research problems) suggested that HB should work better than traditional estimation even with extremely small samples —even sample sizes of fewer than 10 people. The Effects of Disaggregation with Partial-Profile Choice Experiments (Jon Pinnell and Lisa Fridley): Jon and Lisa’s research picked up where Jon’s previous Sawtooth Software Conference paper (from 2000) had left off. In the 2000 conference, Jon examined six commercial CBC data sets and found that Hierarchical Bayes (HB) estimation almost universally improved the accuracy of individual-level predictions for holdout choice tasks relative to aggregate maineffects logit. The one exception was a partial-profile choice experiment in which respondents only saw a subset of the total number of attributes within each choice task. Jon and Lisa decided this year to focus the investigation on just partial-profile choice data sets to see if that finding would generalize. After studying nine commercial partial-profile data sets, Jon found that for four of the data sets simple aggregate logit utilities fit individual holdout choices better than individual estimates under HB. Jon could not conclusively determine which factors caused this to happen, but he surmised that the following may hurt HB’s performance with partial-profile CBC data sets: 1) low heterogeneity among respondents, 2) large number of parameters to be estimated relative to the amount of information available at the individual level. Specifically related to point 2, Jon noted that experiments with few choice concepts per task performed less well for HB than experiments with more concepts per task. Later discussion by Keith Sentis suggested that the inability to obtain good estimates at the individual level may be exacerbated as the ratio of attributes present per task versus total attributes in the design becomes smaller. Jon also suggested that the larger scale parameter previously reported for partial profile data sets relative to full-profile data might in part be due to overfitting, rather than a true reduction in noise for the partial-profile data. One-Size-Fits-All or Custom Tailored: Which HB Fits Better? (Keith Sentis and Lihua Li): Keith began his presentation by expressing a concern he has had over the last few years when using Sawtooth Software’s HB software due to its assumption of a single multivariate normal distribution to reflect the population. Keith and Lihua wondered whether that assumption negatively affected the estimated utilities if segments existed with quite different utilities. The authors studied seven actual CBC data sets, systematically excluding some of the tasks to serve as holdouts for internal validation. They estimated the utilities in four ways: 1) by using the entire sample within the same HB estimation routine, 2) by segmenting respondents according to industry sectors and estimating HB utilities within each segment, 3) by segmenting respondents using a K-means clustering procedure on HB utilities, and then re-estimating within each segment using HB, 4) and by segmenting respondents using Latent Class and then estimating HB utilities within each segment. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. vii Keith and Lihua found that whether one ran HB on the entire sample, or whether one segmented first prior to estimating utilities, the upper-level model assumption in HB of normality did not decrease the fit of the estimated utilities to the holdouts. It seemed unnecessary to segment first before running HB. In his discussion of Keith’s paper, Rich Johnson suggested that Keith’s research supports the notion that clean segmentation may not be present in most data sets. Subsequent discussion highlighted that there seemed to be enough data at the individual level (each respondent received usually about 14 to 20 choice tasks) that respondents’ utilities could be fit reasonably well to their own data while being only moderately tempered by the assumptions of a multivariate normal population distribution. Greg Allenby (considered the foremost expert on applying HB to marketing problems) chimed in that Keith’s findings were not a surprise to him. He has found that extending HB to accommodate multiple distributions leads to only minimal gains in predictive accuracy. Modeling Constant Sum Dependent Variables with Multinomial Logit: A Comparison of Four Methods (Keith Chrzan and Sharon Alberg): Keith and Sharon used aggregate multinomial logit to analyze three constant sum CBC data sets under different coding procedures. In typical CBC data sets, respondents choose just one favored concept from a set of concepts. With constant sum (allocation) data, respondents allocate, say, 10 points among the alternatives to express their relative preferences/probabilities of choice. The first approach the authors tested was to simply convert the allocations to a discrete choice (winner takes all for the best alternative). Another approach coded the 10-point allocation as if it were 10 independent discrete choice events, by applying the winner take all method weighted by the allocation, with separate tasks to reflect each alternative receiving an allocation. Keith noted that this was the method used by Sawtooth Software’s HB-Sum software. Another approach involved making the allocation task look like a series of interrelated choice sets, the first showing that the alternative with the most “points” was preferred to all other; the second showing that the second most preferred alternative was preferred to the remaining concepts (not including the “first choice”), etc. The last approach was the same as the previous, but with a weight given to each task equal to the allocation for the chosen concept. Using the Swait-Louviere test for equivalence of parameters and scale, Keith and Sharon found that the different models were equivalent in their parameters for all three data sets, but not equivalent for scale for one of the data sets. Keith noted that the difference in scale could indeed affect the results of choice simulations. He suggested that for logit simulations this difference was of little concern, since the researcher would likely adjust for scale to best fit holdouts anyway. Keith concluded that it was comforting that different methods provided quite similar results and recommended the coding strategy as used with HB-Sum, as it did not discard information and seemed to be the easiest for his group to program whether using SAS, SPSS, LIMDEP or LOGIT. Dependent Choice Modeling of TV Viewing Behavior (Maarten Shellekens): Marten described the modeling challenges involved with studying TV viewing behavior. Given a programming grid with competing networks offering different programming selections, respondents indicate which programs they would watch in each time slot, or whether they would not watch at all. As opposed to traditional CBC modeling where it is assumed that a respondent’s choice within a given task is independent of selections made in previous tasks, with viii 2001 Sawtooth Software Conference Proceedings: Sequim, WA. TV viewing there are obvious interdependencies. For example, respondents who have chosen to watch a news program in an earlier time slot may be less likely to choose another news program later on. Marten discussed two ways to model these dependencies. One method is to treat the choice of programming in each time slot as a separate choice task, and to treat the availability of competing alternatives within the same time slots as context effects. Another approach is to treat the “path” through the time slots and channels as a single choice. The number of choice alternatives per task dramatically increases with this method, but Maarten argued that the results may in some cases be superior. Alternative Specifications to Account for the “No-Choice” Alternative in Conjoint Choice Experiments (Rinus Haaijer, Michel Wedel, and Wagner Kamakura): Rinus and his coauthors addressed the pros and cons of including No-Choice (None, or Constant Alternatives) in choice tasks. The advantages of the None alternative are that it makes the choice situation more realistic, it might be used as a proxy for market penetration, and it promotes a common scaling of utilities across choice tasks. The disadvantages are that it provides an “escape” option for respondents to use when a choice seems difficult, less information is provided by a None choice than a choice of another alternative, and potential IIA violations may result when modeling the None. Rinus provided evidence that some strategies that have been reported in the literature for coding the None within Choice-Based Conjoint can lead to biased parameters and poor model fit —particularly if some attributes are linearly coded. He found that the None alternative should be explicitly accounted for as a separate dummy code (or as one of the coded alternatives of an attribute) rather than just be left as the “zero-state” of all columns. The coding strategy that Rinus and co-authors validated is the same that has been used within Sawtooth Software’s CBC software for nearly a decade. History of ACA (Rich Johnson): Rich described how in the early ‘70s he developed a system of pair-wise trade-off matrices for estimating utilities for industry-driven problems having well over 20 attributes. Rich was unaware of the work of Paul Green and colleagues regarding fullprofile conjoint analysis, which he noted would have been of immense help. He noted that practitioners like himself had much less interaction with academics than they do today. Rich discovered that trade-off matrices worked fairly well, but they were difficult for respondents to complete reliably. About that same time, small computers were being developed, and Rich recognized that these might be used to ask trade-off matrices. He also figured that if respondents were asked to provide initial rank-orders within attributes, many of the trade-offs could be assumed rather than explicitly asked. The same information could be obtained for main-effects estimation with many fewer questions. These developments marked the beginning of what became the computerized conjoint method Adaptive Conjoint Analysis (ACA). Rich founded Sawtooth Software in the early ‘80s, and the first commercial ACA system was released in 1985. ACA has benefited over the years from interactions with users and academics. Most recently, hierarchical Bayes methods have improved ACA’s utility estimation over the previous OLS standards. Rich suggested that ACA users may be falsely content with their current OLS results and should use HB estimation whenever possible. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. ix A History of Choice-Based Conjoint (Joel Huber): Joel described the emergence of ChoiceBased Conjoint (discrete choice) methods and the challenges that researchers have faced in modeling choice-based data. He pointed to early work in the ‘70s by McFadden, which laid the groundwork for multinomial logit. A later paper by Louviere and Woodworth (1983) kicked off Choice-Based Conjoint within the marketing profession. Joel discussed the Red-Bus/Blue Bus problem and how HB helps analysts avoid the pitfalls of IIA. Propelled by the recent boost offered by individual-level HB estimation, Joel predicted that Choice-Based Conjoint would eventually overtake Adaptive Conjoint Analysis as the most widely used conjoint-related method. Recommendations for Validation of Choice Models (Terry Elrod): Terry criticized two common practices for validating conjoint models and proposed a remedy for each. First, he criticized using hit rates to identify the better of several models because they discard too much information. Hit rate calculations consider only which choice was predicted as being most likely by a model and ignores the predicted probability of that choice. He pointed out that the likelihood criterion, which is used to estimate models, is easily calculated for holdout choices. He prefers this measure because it uses all available information to determine which model is best. It is also more valid than hit rates because it penalizes models for inaccurate predictions of aggregate shares. Second, Terry warned against the common practice of using the same respondents for utility estimation and validation. He showed that this practice artificially favors utility estimation techniques that over-fit respondent heterogeneity. For example, it understates the true superiority of hierarchical Bayes estimation (which attenuates respondent heterogeneity) relative to individual-level estimation. He suggested a four-fold holdout procedure as a proper and practical alternative. This approach involves estimating a model four times, each time using a different one-fourth of the respondents as holdouts and the other three-fourths for estimation. A model’s validation score is simply the product of the four holdout likelihoods. x 2001 Sawtooth Software Conference Proceedings: Sequim, WA. KNOWLEDGE AS OUR DISCIPLINE Chuck Chakrapani, Ph.D. Standard Research Systems / McMaster University Toronto, Canada Traditionally, marketing research has been considered a discipline that uses scientific methods to collect, analyze, and interpret data relevant to marketing of goods and services. The acceptance of this definition has prevented marketing researchers from being meaningful partners in the decision-making process. The way marketing research has been positioned and practiced over the years appears to be at odds with the new information age and management decision requirements. There seems to be an immediate need to redefine our discipline and our role in management decision making. In 1961, the American Marketing Association defined marketing research as “the systematic gathering, recording, and analyzing of data about problems relating to the marketing of goods and services”. Implied in this definition is the idea that marketing researchers have no direct involvement in the process of marketing decision making. Their role is to provide support to the real decision makers by providing the information asked for by them. Academics readily accepted the AMA definition and its implications as evidenced by the following typical quote from a textbook: Marketing research is the systematic process of purchasing relevant information for marketing decision making (Cox and Evans 1972, p. 22; emphasis added). Authors such as Kinear and Taylor (1979) went a step further and explicitly made the point that only the decision maker had a clear perspective with regard to information requirements: Only the manager has a clear perspective as to the character and specificity of the information needed to reduce the uncertainty surrounding the decision situation (Kinnear and Taylor 1979, p. 25; emphasis added). By inference, marketing researchers have nothing to contribute to the character and specificity of the information needed to reduce the uncertainty surrounding the decision situation; only the manager has a clear perspective on these matters. This idealized version of the decision maker as someone who has a clear perspective on what he or she needs to make sound decisions is as much a myth as the concept of the “rational man” of the economic disciplines of yesteryears. Both these romanticized portraits—decision maker with a clear perspective and rational man who optimizes his well-being/returns—sound enticing and plausible in theory but are seldom obtained in practice. Yet experienced marketing researchers know that they have a lot to contribute to the character and specificity of the information needed to reduce the uncertainty surrounding the decision. In fact, it is one of the most important bases on which a good researcher is distinguished from a mediocre one. By defining marketing research in terms of its narrow functional roles rather than by its broad overall goals, we have acutely limited the growth of marketing research as a serious discipline striving to create a core body of knowledge. To define ourselves by our functional roles rather than by our overall goals is similar to a university defining itself as a collection of 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 1 buildings with employees whose job it is to publish papers and lecture students who pay. The results of a narrow functional definition have led to a tunnel vision in our profession. Its consequences have been far-reaching and not in positive ways either. At the dawn of the twenty-first century, marketing research stands at the threshold of irrelevance, as the following facts indicate: • In 1997, Financial Times of London published a 678-page book, The Complete MBA Companion, with the assistance of three major international business schools: IMD, Wharton, and the London Business School. The book had 20 modules that covered a wide range of subjects relevant to management. Marketing research is not one of them. The term “marketing research” is not even mentioned in the index. • In 1999, Financial Times published another book, Mastering Marketing Management, this time with the assistance of four major international business schools: Kellogg, INSEAD, Wharton, and the London Business School. The book had 10 modules and covered the entire field of marketing. Again, marketing research is not one of them. No module discussed the topic of marketing research directly. Rather, there were some indirect and sporadic references to the uses of marketing research in marketing. Apparently, the field of marketing can be mastered without having even a passing familiarity with marketing research. • The following books, many of them business bestsellers, completely ignore marketing research: Peters and Waterman’s In Search of Excellence and A Passion for Excellence, Porter’s Competitive Advantage, Rapp and Collins’s Maximarketing, and Bergleman and Collins’s Inside Corporate Innovation (Gibson 2000). There is more. The advent of new technologies—the Internet, data mining, and the like— brought with it a host of other specialists who started encroaching upon the territory traditionally held by marketing researchers and minimizing their importance even further. All this reduced marketing researchers to a role of order takers. Yet, beneath the surface, things have been changing for a while. In 1987, AMA revised its definition of marketing research and stated that: “Marketing research is the use of scientific methods to identify and define marketing opportunities and problems; generate, refine, and evaluate marketing actions; monitor marketing performance; and improve our understanding of marketing as a process (Marketing News 1987, p. 1).” 2 2001 Sawtooth Software Conference Proceedings: Sequim, WA. This extended definition acknowledged that information is used to identify and define marketing opportunities and problems; generate, refine, and evaluate marketing actions; monitor marketing performance; and improve understanding of marketing as a process. Marketing research is the function that links the consumer, customer, and public to the marketer through information. Thirteen years before the dawn of the third millennium, AMA acknowledged that marketing research is much more than collecting and analyzing data at the behest of the “decision makers”; it is a discipline in its own right and is involved in improving our understanding of marketing as a process. With this new definition, marketing research is not a content-free discipline that merely concerns itself (using methods heavily borrowed from other disciplines) with eliciting information with no thought given to accumulating, codifying, or generalizing the information so elicited, but a discipline with content that is relevant to marketing as a process. As one of the prescriptions to reviving the role of research, Mahajan and Wind (1999) completely reversed the earlier view that “only the manager has a clear perspective as to the character and specificity of the information” (Kinnear and Taylor 1979, p. 22) and stated that the “biggest potential in the use of marketing research is … in helping management ask the right strategic questions.” They went on to suggest that “marketing researchers need to give it a more central role by connecting it more closely to strategy processes and information technology initiatives” (Mahajan and Wind 1999, pp. 11–12). The president and CEO of Hasbro, Herb Baum, the president and CEO echoed this view: “[Market research] could improve productivity, if the department were to run with the permission, so to speak, to initiate projects rather than be order takers. . . I think they [market researchers] would be more productive if they were more a part of the total process, as opposed to being called in cafeteria style to do a project.” (Fellman, 1999). But we how do we reclaim our relevance to the marketing decision processes? Before we answer this question let’s ask ourselves another question: What makes us marketing researchers? There must be an underlying premise to our discipline. There must be a point of view that defines our interests. There must be an underlying theme that motivates us and makes us define ourselves as practitioners of the profession. That underlying theme cannot simply be the collection, analysis, and interpretation of data. This theme has not served us well in the past and has led us to a place where we are already discussing how we can effectively continue to exist as a profession and reestablish our relevance without facing imminent professional extinction. The underlying theme that propels a marketing researcher, it seems to me, should be much more than the collection, analysis, and interpretation of data. I would like to propose that it is the search for marketing knowledge. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 3 FROM DATA TO A CORE BODY OF KNOWLEDGE The quest for knowledge is not unique to marketing researchers. It is common to all researchers. Many scientific paradigms are based on inductive reasoning, followed by a deductive verification of the hypotheses generated by induction. It is no different in marketing research. In marketing research, we also have the opportunity to follow the scientific process of accumulating data in order to derive lawlike relationships. Marketing researchers seek knowledge at various levels of abstraction. Consider the following marketing questions that move from very specific to very generalized information: • How many consumers say that they intend to buy brand X next month? • How many consumers say that they intend to buy the product category? • How many of the consumers who say that they intend to buy the brand are likely to do so? • Can the intention–behavior relationship be generalized for the product category? • Can it be generalized across all product categories? • Can we derive any lawlike relationships that will enable us to make predictions about consumer behavior in different contexts? Clearly, these questions require different degrees of generalization. We go from data to information to lawlike relationships to arrive at knowledge (Ehrenberg and Bound, 2000). Because our discussion is focused on deriving lawlike relationships that lead to knowledge, we review these concepts briefly here. Information The term information refers to an understanding of relationships in a limited context. For example, correlational analysis may show that loyal customers of firm x are also more profitable customers. This is information because, though it is more useful than raw data, it has limited applicability. We don’t know whether this finding is applicable to other firms or even to the same firm at a different point in time. Lawlike Relationships By increasing the conditions of applicability of information, we arrive at lawlike relationships. In the preceding example, if it can be shown that loyalty of customers is related to profitability of a firm across different product categories and across different geographic regions, we have what is known as a lawlike relationship. The other characteristics of lawlike relationships (Ehrenberg 1982) are that they are 1. General, but not universal. We can establish under what conditions a lawlike relationship holds. Exceptions do not minimize the value of lawlike relationships; 2. Approximate. Absolute precision is not a requirement of lawlike relationships; 3. Broadly descriptive and not necessarily causal. In our example, our lawlike relationship does not say that customer satisfaction leads to profits; and 4 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 4. Of limited applicability and may not lend themselves to extrapolation and prediction. Lawlike relationships cannot be assumed to hold in all contexts; they have to be verified separately under each context. Knowledge Accumulation of lawlike relationships leads to knowledge. In the words of Ehrenberg and Bound (2000, pp. 24–25), “Knowledge implies having a wider context and some growing understanding. For example, under what different circumstances has the attitude–behavior relationship been found to hold; that is, how does it vary according to the circumstances? With knowledge of this kind, we can begin successfully to predict what will happen when circumstances change or when we experimentally and deliberately change the circumstances. This tends to be called ‘science,’ at least when it involves empirically grounded predictions that are routinely successful.” Unlike in many other disciplines, the gathering of lawlike relationships has been less vigorously pursued in market research. Whenever we are asked by non-research professionals how some marketing variables work (for example, the relationship between advertising expenditure and sales, the relationship between attitude and behavior), we are painfully reminded how little accumulated knowledge we really have on the subject despite our professional credentials. “Core Body of Knowledge” In general usage, the phrase “core body of knowledge” also refers to the skill set a person is expected to have to qualify for the title market researcher. However, in this paper, the term refers to marketing knowledge derived through the use of marketing research techniques. Practical Uses of Lawlike Relationships With Examples As practitioners of an applied discipline, we can argue that it is the method itself (which may include several aspects such as sampling, statistics, data collection and analysis methods, and mathematical model building in a limited context), not the knowledge generated by the method, that should be of concern to us. Such an argument has some intrinsic validity. However, this view is shortsighted, because it is wasteful and ignores the feedback that knowledge can potentially provide to strengthen the method. Consider the lawlike relationship that attribute ratings of different brands roughly follow the market share. In general, larger brands are rated more highly on practically all brand attributes, and smaller brands are rated low on practically all brand attributes (Ehrenberg, Goodhardt, and Barwise 1990; McPhee 1963). A lawlike relationship such as this will enable the researcher to understand and interpret the data much better. For example, while comparing two brands with very different market shares, the researcher may realize that it may be unproductive to carry out significance tests between the two brands because practically all attributes favor the larger brand. Instead, the researcher may look at mean-corrected scores to assess whether there are meaningful differences, after accounting for the differences between the two brands that may be due to their market share. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 5 When we use lawlike relationships, our analysis becomes more focused. In the preceding example, we would be more interested in those attributes that do not conform to the lawlike relationship (e.g., the large brand being rated low on a given attribute). Because lawlike relationships are a summary of prior knowledge, using them forces us to take into account what is already known. As another example, if we can uncover a lawlike relationship between advertising and sales in different media, then we can use this to build further models and simplify and strengthen data analysis. Confirmed theories become prior knowledge when we analyze new data. As these examples show, knowledge can play an important role in marketing research, not just from a theoretical point of view, but from a practical point of view as well. It is also true that the more we know about how marketing variables actually work, the more focused our data collection will be. From a practical point of view, firms do not have to pay for irrelevant data, information that does not in any way contribute to marketing decision making. Braithwaite (1955, p. 1) explains the importance of formulating lawlike relationships this way: “The function of science … is to establish general laws covering the behaviors of the empirical events or objects with which the science in question is concerned, and thereby to enable us to connect together our knowledge of separately known events, and to make reliable predictions of events as yet unknown.” We need to connect what we know through “separately known events” to derive knowledge that we can then use to predict what is not yet known. Deriving knowledge and using it to predict (e.g., success of new products, the relationship between advertising and sales, the relationship between price and quality) seem to be worthy goals for marketing researchers. Knowledge, therefore, is not a luxury but a powerful tool that can contribute to the collection of relevant data, avoidance of irrelevant data, and more focused analysis of the data so collected. Because knowledge leads to more relevant data and more focused analysis, we can lower the cost of research while increasing its usefulness to decision making. Knowledge contributes to more relevant and cost-efficient research. WHY DON’T WE HAVE A CORE BODY OF KNOWLEDGE? Given all the benefits of having a core body of knowledge, we must ask ourselves why we failed to develop it, despite the fact that every other science has been steadfastly accumulating lawlike relationships. How is that even the presence of some of the brightest minds that any profession can boast of—from Alfred Politz to George Gallup, from Hans Zeisel to Andrew Ehrenberg, from Louis Harris to Paul Green—failed to propel us into rethinking our role in decision making until now? There are many reasons, including • • • 6 The way the marketing research has been taught, The way marketing research has been perceived over the years, The divergent preoccupations of academics (quantitative methods and model building) and practitioners (solving the problem at hand as quickly and cost effectively as possible), 2001 Sawtooth Software Conference Proceedings: Sequim, WA. • • Lack of immediate rewards, and An over-concern about confidentiality. All of these reasons, except for the last two, are expounded in different places in this paper. Lack of immediate rewards and over-concern about confidentiality are discussed next. Lack of Immediate Rewards To be a market researcher, one does not necessarily have to generalize knowledge. In many other disciplines, this is not the case. Specific information is of use only to the extent that it contributes to a more generalized understanding of the observed phenomenon. In marketing research, information collected at a given level of specificity does not need to lead to any generalization for it to be useful. It can be useful at the level at which it is collected. For example, the question, “How many consumers say that they intend to buy brand X next month?” is a legitimate one, even if it does not lead to any generalization about consumers, brands, products, or timelines. It does not even have to be a building block to a higher level of understanding. For marketing researchers, information is not necessarily the gateway to knowledge. Information itself is often the knowledge sought by many marketing researchers. Maybe because of the way we defined ourselves, no one seems to expect us to have a core body of knowledge. If no one expects this of us, if there is considerable work but not commensurate reward for it, why bother? Because there is no tangible reward, there is no immediacy about it either. When there is never immediacy, it is only natural to expect that things will not get accomplished. Over-Concern About Confidentiality The bulk of all market research data is paid for by commercial firms that rightfully believe that the data exclusively belong to them and do not want to share them with anyone, especially with their competition. This belief has such intrinsic validity that it clouds the fact that, in most cases, confidentiality is not particularly relevant. Let us consider data that are few years old and do not reflect current market conditions. In such cases, data are of little use to competitors—the data are dated, and the market structure has changed. It is because of such concerns that many firms repeat similar studies year after year. Although old data are of little use to current marketing decision making, they could be of considerable use to researchers trying to identify the underlying marketing relationships. Yet, as experience shows, it is extremely difficult to get an organization to release past research data, even when concerns of confidentiality have no basis in fact. The concept of confidentiality is so axiomatic and so completely taken for granted in many businesses that it is not even open for discussion. It is as though the research data come with a permanent, indelible label “confidential” attached to them. A case can be made that non-confidential data held confident do not promote the well-being of an organization but simply deprive it of the generalized knowledge that can be built on such data. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 7 WHAT TAKES THE PLACE OF KNOWLEDGE? Our not having a core body of knowledge has led to at least two undesirable developments; development of models without facts (including “sonking”), and proprietary black box models. Models Without Facts Because we have defined ourselves as essentially collectors and interpreters of data, all we are left with are facts. So those who do realize the importance of models, especially in the academe, have attempted to build models with sparse data sets and substituted advanced statistical analysis for empirical observation. It is not uncommon to find research papers that attempt to develop elaborate marketing models based on just a single set of data supported by complex mathematics. Unfortunately, no amount of mathematics, no number of formulas, and no degree of theorizing can compensate for the underlying weakness: lack of data. When we use models without facts, we do not end up with a core body of knowledge but with, as Marder (2000, p. 47) points out, a “premature adoption of the trappings of science without essential substance.” Sonking A particular variation of models without facts is sonking, or the scientification of nonknowledge. It is the art of building scientific looking models with no empirical support so although they look generalizable, they are not. The use of sonking is sharply illustrated by Ehrenberg (2001): “…one minute the analyst does not know how Advertising (A) causes Sales (S); the next minute his computer has S=5.39A+14.56 as the ‘best’ answer. The label on the least squares SPSS/SAS regression bottle says so …” Applying multiple regression analysis to a single set of data and using the resultant coefficients to decide multimillion dollar product decisions is an example of sonking. The regression coefficient derived from a single study is dependent on many factors (e.g. Number and type of variables in the equation, the data points, presence of collinear variables, special factors affecting the dataset, to name a few). Even if we knew all the variables that could potentially affect the dependent variable, entering all those variables will result in over fitting the model. The problems are further complicated by the fact that in most cases in marketing research multiple regression is applied to survey data (as opposed to experimental data) where there may be other confounding variables. The only way we can establish the validity of the relationship is through a number of replications. In sonking, the inductive method of science is replaced by scientific looking equations supported by tests of significance which assures the marketer that the results are valid within certain specified margin. Yet tests of significance were never designed to be a shortcut to empirical knowledge. This is not just a theoretical concern. In a recent JMR paper, Henard & Szymanski (2001) reviewed a number of studies in an attempt to understand why some new products are more successful than others. Their analysis compared correlations between different attributes and product performance obtained in 41 different studies. Some of the correlation ranges are reported in Exhibit 1. 8 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Exhibit 1: Predictors of new product performance (range of correlations obtained different studies) –Product advantage –Product innovativeness –Technological synergy –Likelihood of competitive response –Competitive response intensity –Dedicated resources –Customer input –Senior management support Low -0.31 -0.62 -0.73 -0.60 -0.72 -0.19 -0.21 -0.07 High +0.81 +0.81 +0.68 +0.05 +0.63 +1.00 +0.81 +0.46 What is surprising about this exhibit is that for every single predictor variable, the coefficients are both negative and positive. For instance how does product advantage affect performance? Well, according to the data it sometimes it will aid product performance and other times it will hinder it. It is not just for this variable for practically all variables studied (the exhibit shows only a few variables to illustrate the point). It is difficult to believe that there are no consistent relationships between any of these variables and product performance. There is not even clear directionality! We can conclude one of two things: 1. Each study was done differently, with different definitions and variables that it produced no generalizable knowledge. After reviewing 41 studies we know as little about the relationship between different independent variables and product performance as we ever did; or 2. Different relationships are applicable to different contexts and we have no idea what such conditions might be. The problem with lack of generalized knowledge is that we have no way of checking the validity of new information or even our own analysis. If, through a programming error, we obtain a negative coefficient instead of a positive one, we have nothing in our repertoire that will alert to this potentially harmful error. Anything is possible and this leads to further sonking. In the absence of empirical knowledge who is to say something is nonsensical (as long as it is “statistically significant”)? Many analysts seem to be willing to analyze data with purely statistical methods. Colombo et al. (2000) asked 21 experts to analyze a set of brand switching tables. All analysts were provided the same set of questions. These 21 experts collectively used 18 different techniques to understand a set of straightforward contingency tables. Their conclusions did not converge either. This is not to be critical of the analysts, but to point out that without prior knowledge, without an idea as to what to look for, even experts cannot provide us with an answer the validity of which we can be comfortable with. Market researchers who are serious about developing a core body of knowledge do not, and should not, believe in premature theorizing and model building. Unfortunately, as Marder (2000) points out, academics that mainly develop theories and models do not have the vast resources needed to test them; practitioners who have access to data are not necessarily interested in theory building. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 9 As it stands, we have the two solitudes: facts without models and models without facts. However, the basis of all sciences is empirical data. If our models and theories are not supported by extensive empirical data, then we don’t have a core body of knowledge. BLACK BOX MODELS Another outcome of the lack of a strong core body of knowledge is the proliferation of black box models. These are “proprietary” models developed by commercial vendors that are claimed to predict marketing successes of different marketing activities. For example, there are black box models that purport to predict future sales volumes of new products or forecast the success of a commercial yet to be aired. Because the mechanics of the models are not revealed to the buyer, the buyer necessarily has to rely on the persuasion skills and promises of the vendor. Black box models do not allow the buyers to evaluate for themselves the reasonableness and correctness of the assumptions and procedures involved. Black box models implicitly assume to have uncovered some lawlike relationship or a precise way of identifying how (and which) independent variables relate to a dependent variables. Yet such models are of unknown validity and suspect because “the necessarily simple concepts behind a good method can hardly be kept a secret for long” (Ehrenberg and Bound 2000, p. 40). Claims of precise predictability by proponents of black box models can at times stretch one’s credulity. If such precise knowledge were possible, it is only reasonable to assume that it would not have eluded the thousands of other researchers who work this rather narrow field of inquiry. We can of course never be sure, because there is no way to subject these models to scientific scrutiny. If proponents of the model cannot prove the validity of the model, neither can we disprove its lack of validity. This leads us to our next point. We are uncomfortable with black box models not necessarily because there may be less to them than meets the eye, but because they lack the hallmark of scientific models: refutability (Popper 1992b). As Kuhn (1962) argues, even science is not protected from error. One main reason the scientific method is accepted in all disciplines is that science is self-correcting. Its models are open, and anyone can refute it with the use of logic or empirical evidence. Any knowledge worth having is worth scrutinizing. The opaqueness of black box models makes them impervious to objective scrutiny. Consequently, in spite of their scientific aura and claims to proprietary knowledge, black box models contribute little to our core body of knowledge. Unfortunately, the less we know about marketing processes, the more marketers will be dependent on black boxes of unknown validity. A SLOW MARCH TOWARD A CORE BODY OF KNOWLEDGE In a way, perhaps we have known all along that we need a core body of knowledge. We can assume that academics, who attempted to build models, though they did not have access to large volumes of data, did so in an attempt to develop a core body of knowledge. But not enough has been done, and we continue to lack a core body of knowledge. 10 2001 Sawtooth Software Conference Proceedings: Sequim, WA. In marketing research, lawlike relationships are orphans. Applied researchers tend to concentrate on gathering and interpreting information, whereas academics tend to concentrate on methodology and techniques. Academics are mainly concerned with the how well (techniques and methods), whereas applied researchers are mainly concerned with the how to (implementation and interpretation). Academics act as enablers, whereas applied researchers use the techniques to solve day-to-day problems. But, as we have been discussing, a mature discipline should be more than a collection of techniques and agglomeration of facts. It should lead to generalizable observations. It should lead to knowledge. We can think of knowledge as our discipline. Not just information, not just techniques. Information and skills imparted through techniques are not only ends in themselves (though they often can be), but also means to an end. That end is knowledge, and therefore, knowledge is our discipline. Marketing research can be thought of as a collection of techniques, an approach to solving problems as well as a means of uncovering lawlike relationships, which is our final goal. From an overall perspective, we need to merge facts with models, theory with practice. Models that cannot be empirically verified have no place in an applied discipline like marketing research. To arrive at knowledge, we convert data into information and information into lawlike relationships and merge empirical facts with verifiable theory. I think the last point is worth emphasizing. Marketing research should be more than just an approach to solving problems—it should result in an uncovering of lawlike relationships in marketing. It should not be just a collection of tools—it should also be what these tools have produced over a period of time that is of enduring value. Market researchers should be able to say not only that they have the tools to solve a variety of problems, but also that they have developed a core body of knowledge using these tools. In short, marketing research is not just a “means” discipline, but an “ends” discipline as well. KNOWLEDGE AS OUR DISCIPLINE Developing lawlike relationships and creating a core body of knowledge are marketing research’s contribution to the understanding of marketing as processes. We cannot overemphasize that marketing research does not exist solely to provide information to decision makers, but also to develop a core body of marketing knowledge. Marketing research does not simply provide input to decision makers, but is also a part of the decision-making process. Marketing research is not peripheral to, but an integral part of, marketing decision making. To treat it otherwise impoverishes both marketing and marketing research. To develop a core body of knowledge, we need to reexamine and perhaps discard many currently held beliefs, such as • The sole purpose of marketing research is to collect and analyze data, • Marketing researchers cannot directly participate in the decision-making processes, • It is acceptable for decision makers to have an implicit faith in models that cannot be put through transparent validity checks (proprietary black boxes), 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 11 • Marketing researchers can be an effective part of the decision-making process without possessing a core body of knowledge, • All data are forever confidential, and • Lawlike relationships can be uncovered with the sheer strength of mathematical and statistical techniques without our first having to build a strong empirical base. I believe that we need to work consciously toward creating a core body of knowledge. We deliberately need to share information and data. We need to discourage secret knowledge, sonking and unwarranted confidentiality. We need not to be content with just immediate solutions to marketing problems. For those of us who have long believed that marketing research is more than a glorified clerical function, it has been obvious that a substantial part of marketing research should be concerned with developing knowledge that contributes to our understanding of marketing as a process. For anyone who accepts this premise, it is self-evident that knowledge is our discipline. If it had not been so in the past, it should be so in the future, if we are to fulfill our promise. 12 2001 Sawtooth Software Conference Proceedings: Sequim, WA. BIBLIOGRAPHY American Marketing Association (1961), Report of Definitions Committee of the American Marketing Association. Chicago: American Marketing Association. ——— (2000), “The Leadership Imperative,” Attitude and Behavioral Research Conference, Phoenix, AZ (January 23–26). Braithwaite, R. (1955), Scientific Explanation. Cambridge: Cambridge University Press. Colombo, Richard, Andrew Ehrenberg and Darius Sabavala. (2000) “Diversity in Analyzing Brand-switching Tables: The Car Challenge”, Canadian Journal of Marketing Research, Vol. 19, 23-36. Cox, Keith K. and Ben M. Evans (1972), The Marketing Research Process. Santa Monica, CA: Goodyear Publishing Company. Ehrenberg, A.S.C. (1982), A Primer in Data Reduction. Chichester, England: John Wiley & Sons. Ehrenberg, A.S.C. (2001) “Marketing: Romantic or Realist”. Marketing Research (Summer) 4042. ——— and John Bound (2000), “Turning Data into Knowledge,” in Marketing Research: Stateof-the-Art Perspectives, Chuck Chakrapani, ed. Chicago: American Marketing Association, 23–46. ———, G.J. Goodhardt, and T.P. Barwise (1990), “Double Jeopardy Revisited,” Journal of Marketing, 54 (3), 82–89. Fellman, Michelle Wirth (1999), “Marketing Research is ‘Critical’,” Marketing Research: A Magazine of Management & Applications, 11 (3), 4–5. Financial Times (1997), The Complete MBA Companion. London: Pitman Publishing. ——— (1999), Mastering Marketing. London: Pitman Publishing. Gibson, Larry (2000), “Quo Vadis, Marketing Research?” working paper, Eric Marder Associates, New York. Henard, David H. and David M. Symanski. “Why Some New Products are More Successful Than Others”, Journal of Marketing Research, Vol. XXXVIII, August. 362-375. Kinnear, Thomas C. and James Taylor (1979), Marketing Research: An Applied Approach. New York: McGraw-Hill. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 13 Kuhn, Thomas (1962), The Structure of Scientific Revolutions. Chicago: University of Chicago Press. Mahajan, Vijay and Jerry Wind (1999), “Rx for Marketing Research,” Marketing Research: A Magazine of Management & Applications, 11 (3), 7–14. Marder, Eric (2000), “At the Threshold of Science,” in Marketing Research: State-of-the-Art Perspectives, Chuck Chakrapani, ed. Chicago: American Marketing Association, 47–71. Marketing News (1987), “New Marketing Definition Approved,” (January 2), 1, 14. McPhee, William N. (1963), Formal Theories of Mass Behaviour. Glencoe, NY: The Free Press. Popper, Karl (1992a [reprint]), The Logic of Scientific Discovery. London: Routledge. ——— (1992b [reprint]), Conjectures and Refutations: The Growth of Scientific Knowledge, 5th ed. London: Routledge. 14 2001 Sawtooth Software Conference Proceedings: Sequim, WA. PARADATA: A TOOL FOR QUALITY IN INTERNET INTERVIEWING Ray Poynter The Future Place Deb Duncan Millward Brown IntelliQuest What is the biggest difference between Internet interviewing and face-to-face interviewing? What is the biggest difference between Internet interviewing and CATI? On the face of it the biggest differences might appear to be the presence of the new technologies, the presence of an international medium, the presence of a 24/7 resource. However, the biggest difference is the absence of the interviewer. Historically, the interviewer has acted as the ears and eyes of the researchers. The interviewer has also operated as an advocate for the project, for example by persuading respondents to start and then to persevere and finish surveys. If something is going awry with the research there is a good chance that the interviewer will notice it and alert us. All of this is lost when the interviewer is replaced with the world’s biggest machine, namely the Web. Although we have lost the interviewer we have gained an enhanced ability to measure and monitor the process. This measuring and monitoring is termed Paradata. Paradata is one of the newest disciplines of market research, although like most new ideas it includes and reflects many activities that have been around for a long time. Paradata is data about the process, for example Paradata includes: interview length, number of keystrokes, details about the browser and about the user’s Internet settings. The introduction of the term Paradata is credited to Mick Couper from the Institute of Social Research, University of Michigan. His initial work (1994 and 1997) was conducted with CAPI systems, looking at the keystrokes of interviewers using CAPI systems. This measuring and monitoring, this Paradata, enables the researcher to put some intelligence back into the system, to regain control, and to improve quality. PARADATA: A TOOL FOR EXPLORATION The first person to popularize Paradata as a means of understanding more about the online interviewing process and experience was Andrew Jeavons (1999). Jeavons analyzed a large number of server log files to see what was happening. He noted the number of errors, backtracks, corrections, and abandoned interviews. He discovered that higher levels of mistakes and confusion were associated with higher rates of abandoned interviews (not a surprising outcome). In particular, he found that grid questions and questions where respondents had to type in numbers which summed to some figure caused more errors, corrections, and abandoned interviews. This finding led Jeavons to advise against these types of questions, or at least to avoid correcting the respondent when they failed to fill in the questions to the pattern desired by the researcher. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 15 Jeavons took this analysis further in his 2001 paper, which associated the term Paradata with his explorations and identified a number of phenomena that occur in interviews. One such phenomenon is cruising, where the respondent adopts some strategy to complete the interview quickly, for example a respondent who always selects the first option in any list. In his paper Jeavons started to explore how problems could be identified. Jeavons identified three uses of Paradata: • Questionnaire Optimization — Paradata can be used to help us write the questions in the best way possible. Confusion, delays, missing data all indicate that the questionnaire could be improved. • Quality Control — Paradata can help us identify cases where respondents are making mistakes (e.g. sums not adding to a specific figure), and cases where respondents may be short-cutting the interview (e.g. selecting the first item from each list), a process Jeavons terms cruising. • Adaptive Scripting — Jeavons raises the prospect of using Paradata to facilitate adaptive scripting. For example, we might consider asking respondents with faster connections and faster response times more questions. Adapting the approach of adaptive approaches such as ACA it may be possible to route respondents to the sort of interview that best suits their aptitudes. Paradata with CAPI and CATI With 20/20-hindshight we can re-classify many aspects of CAPI and CATI as Paradata. Key CAPI items include: the date of data collection, the length of the interview, and with many systems the length of time spent on each question is also recorded. Advanced systems such as Sawtooth Software’s ACA capture statistics such as the correlation between the scores on calibration concepts and the conjoint utilities – this coefficient provides an estimate of how well the interview process has captured the person’s values. CATI systems collect data about the respondent’s interaction with the interview (e.g. date, length etc) and also about the interviewer’s interaction with the software. Paradata elements that relate to the interviewer include statistics about inter-call gaps, rejection rates, probing, and a wide variety of QA characteristics. The main difference between CAPI/CATI Paradata and the growing practice of Internet Paradata is accessibility. In the CAPI/CATI context most Paradata is never seen beyond the programming or DP department. By contrast, researchers have shown a growing interest in the information that can be unleashed by the use of Paradata. Optimization in Practice At the moment the main practical role for Paradata is in optimizing questionnaires. This optimization process can be conducted in an ongoing fashion, using Paradata to define and refine ongoing guidelines. It can also be used on a project-by-project basis, to optimize individual studies. 16 2001 Sawtooth Software Conference Proceedings: Sequim, WA. For example, after 30 interviews the Paradata can be examined to consider: • how long the interview took • how long did people think the interview took • how the open-ends are working • how many questions have missing data • how many interviews were abandoned • user comments When this approach is adopted it is important that this analysis is carried out on all the people who reached the questionnaire, not just those who complete it! Try to identify if the system crashed for any respondents. From this Paradata you can identify whether any emergency corrections need to be made. Often this may be as simple as things like improvements to the instructions, on other occasions it may mean more significant alterations. Tail-End Checks for Optimization As part of the Paradata process it is useful to add questions at the end of the survey. It should be noted that these check questions can be asked to a sub-set of the respondents, there is no need to ask everybody. For example, a survey could have three tail-end questions, with each respondent being asked just one of them. A suitable configuration would be: • open-end question asking about the overall impression of the interview; • a five point satisfaction scale with the research; • a question asking how long the respondent felt the survey took to complete. A good interview is one where people underestimate how long the survey took. AN OLYMPIC CASE STUDY In 2000 Millward Brown UK conducted a number of online and offline research projects thematically connected with the Olympics. These tests were funded by Millward Brown and allowed a range of quantitative and qualitative techniques to be compared in the context of a single event. This section looks at one of the quantitative projects and the learning that was acquired. The survey was hosted on the web and respondents invited to take part by e-mail, shortly after the Olympics finished. 368 people reached the first question: “Thinking about the Olympic games, which of the following statements best describes your level of interest in the Olympic games?” (a closed question with 5 breaks) 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 17 • • 18 people dropped out at this stage, which was 27% of all dropouts and 4% of those who were offered the question. This is a very typical pattern, in most studies a number of people will visit the survey to see what it is like and then drop out at the first question. 350 people reached the second question: “When you think about the Olympic games what thoughts, feelings, memories and images come to mind?” (an open-ended question) • The key features of this question is that it is very ‘touchy feely’, and in this test it was compulsory (i.e. the respondent had to type something). • 23 people dropped out, 38% of all those who dropped out, 7% of those who reached the question. • This level of dropout on the second question is much higher than we would expect, and would appear to be due to asking a softer open-end, near the beginning, in a compulsory way. 327 people reached the third question: “How would you rate this year's Olympic games in terms of your personal enjoyment of the games?” (a closed question with 5 breaks) • Just 3 people dropped out, 5% of all dropouts, and a dropout rate of 1%. • This is what we would expect to see. 324 people reached the fourth question: “What stands out in your mind from this year’s Olympic games? Please tell us why it stands out for you?” (an open-ended question, again non-trivial, again compulsory) • 10 people dropped out, which was 17% of all dropouts, and a dropout rate of 3%. • This result appears to confirm that it is the softer, compulsory open-ends that are causing the higher dropout rates. 314 people reached the fifth question, which was another closed question. • Just 3 people dropped out. • Over the remaining nine questions only 3 more respondents dropped out. Previous research by us and others had identified that simple, directed, open-ends work very well in Internet interviewing. These Olympic results suggest that we should try to avoid the more open style of question, in favor of directed questions. If the softer, more open, type of 18 2001 Sawtooth Software Conference Proceedings: Sequim, WA. question is needed then it should appear later in the interview, and it should not be compulsory – if that is possible within the remit of the project. This study also looked at response rates and incentives. The original invitations were divided into 3 groups: a prize draw of £100, donation of £1 per completed interview to charity, and one groups with no incentive. Response rates Prize draw Charity donation No incentive 6.7% 5.5% 6.3% The data suggested that small incentives did not work well. Other research has suggested that larger incentives do produce a statistically significant effect, but not necessarily a commercially significant one. CIMS – A CASE STUDY Millward Brown IntelliQuest’s Computer Industry Media Study (CIMS) is a major annual survey that measures US readership of computer and non-computer publications as well as the viewing of broadcast media. Prior to 1999 CIMS employed a two-phase interviewing procedure. Telephone screening qualifies business and household computer purchase decision-makers, to whom Millward Brown IntelliQuest then mails a survey packet. While part of CIMS is administered on a floppy disk, the media section had remained on paper to accommodate a large publication list with graphical logos. Millward Brown IntelliQuest conducted a large scale study in the summer of 1999 to explore the feasibility of data collection via the Internet and to compare the response rates and survey results obtained using three alternative Web-based questionnaires, and the customary paper questionnaire. We obtained a sample of technology influencers from a database of those who had registered a computer or computer-related product in the past year. IQ2.net, a database marketing firm that until recently was owned by Millward Brown IntelliQuest, provided a list of 27,848 records selected at random from their computer software and hardware registration database. This database was then interviewed by phone to find people who were available, who had email and Internet access, and who were prepared to take the survey. In total 2,760 people were recruited in the time available. Those who agreed to cooperate were then assigned at random to receive one of four questionnaire versions, yielding approximately 690 completed phone interviews per version. One cell used the pre-existing paper questionnaire; the other three cells were assigned to three different Web approaches. Regardless of whether the questionnaire was Web-based or paper-based, the respondent had to supply an email address during phone screening. When attempting to contact these respondents electronically, 18% of their email addresses turned out to be invalid. In order to ensure comparable samples, all recipients of the paper questionnaire were also sent an email to 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 19 verify their email address. Those who failed such validation were removed from the respondent base so that all four groups were known to have valid email addresses. Questionnaire 1: The Paper Format The paper-based questionnaire represented the current version of the CIMS media measurement component. The questionnaire began with an evaluation of 94 publications, followed by television viewership, product ownership and usage, and concluding with demographic profiling. The format used to evaluate readership is shown below in Figure 1. Figure 1: Paper Format Given the onerous task of filling out the readership section (four questions by 94 publications), we experimented with three alternative designs to determine the impact on response rate and readership estimates. Questionnaire 2: The Horizontal Web Format The horizontal Web version represented the closest possible visual representation of the paper-based media questionnaire. Respondents viewed all four readership questions across the screen for each publication. Respondents had to provide a "yes" or "no" to the six-month screening question, and were expected to provide reading frequency and two qualitative evaluations (referred to collectively as "follow-up questions") for each publication screened-in. An example of the Web-based horizontal version appears in Figure 2. Due to design considerations inherent to the Web modality, the horizontal Web version differed from the paper questionnaire in a few ways. First, only 7 publications appeared on screen at a time compared with the 21 shown on each page of the paper version. Second, the horizontal Web version did not allow inconsistent or incomplete responses, as did the paper version. This means that the follow-up questions could not be left blank, even in the instance where a respondent claimed not to have read the publication in the past six months! Figure 2: Horizontal Web Format 20 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Questionnaire 3: The Modified Horizontal Web Format The modified horizontal Web version was identical to the horizontal version except that it assumed an answer of "no" for the six-month screen. This allowed respondents to move past publications they had not read in the last six months, similar to the way respondents typically fill out paper questionnaires. As in the Web-based horizontal version, only seven publications appeared on each screen. Questionnaire 4: The Vertical Web Format The vertical Web version differed most from the original paper-based horizontal format. Respondents were first shown a six-month screen, using only black-and-white logo reproductions. After all 94 publications were screened; respondents received follow-up questions (frequency of reading and the two qualitative questions) for those titles screened-in. It was assumed that hiding the presence of follow-up questions on the Web would lead to higher average screen-ins similar to the findings of Appel and Pinnell (1995) using a disk-based format and Bain, et al (1997) using Computer-Assisted Self-Interviewing. An example of the Vertical Format is given in Figure 3. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 21 Figure 3: The Vertical Web Format Response Rates The initial response rates ranged from 54% for the paper format to 37% for the Horizontal Web format. Between the Web formats the values ranged from 48% for the Vertical format to the 37% for the Horizontal format. These data are shown in Figure 4. Figure 4: The Response Rates For The Four Cells 60% 50% 4% 11% 40% 30% 54% 20% 48% 11% 40% 37% W e b M o d ifie d H o r iz o n ta l W e b H o r iz o n ta l 10% 0% Paper W e b V e r tic a l However, we inspected how many people commenced each type of interview a different pattern emerged. For example, 11% of those invited to complete the Web Horizontal started the interview but did not complete it (a drop-out rate of about 23%). When we add the incomplete interviews to the completed interviews, as in Figure 4, we see that the initial response rate for the 3 Web formats was very similar. 22 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Publications Read When we reviewed the number of publications that the respondents claimed to have read (Figure 5), we see that the Vertical Web format elicited significantly higher numbers than any of the other formats. The combination of seeing the titles in groups of 7, and of not seeing the ‘penalty’ in terms of additional questions produced much higher recognition figures. This was something the smaller circulation publications liked, but was clearly an artifact of the design, discovered by the Paradata analysis process. Figure 5: Publications Claimed 14 1 2 .2 * 12 10 1 0 .0 * 8 .6 ** 7 .9 ** 8 6 4 2 0 8.6 P ap e r 12.2 W e b Ve rtical W e b M o d ifie d H o riz o n tal 7.9 W e b H o riz o n tal 10 * S ig n ific a n tly d iffe re n t fro m a ll o th e rs a t 9 5 % c o n fid e n c e . **S ig n ific a n tly d iffe re n t fro m W e b V e rtica l a n d W e b H o riz o n ta l a t 9 5 % c o n fid e n c e . B2B Probing Millward Brown IntelliQuest conduct a great many B2B online interviews and there is always great interest in how far the boundaries can be pushed in terms of utilizing the latest facilities, for example Flash, Java, and Rich Media. The following data is a small example of the data that were obtained by querying the browsers of respondents who were completing a panel interview in early 2001, the countries were UK, France, and Germany. Operating Systems Apple 2% Windows 95 14% Windows 98 43% Windows NT 41% UNIX 0.20% Browsers IE 4 IE 5 IE 5.5 Other/DK 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 9% 54% 27% 10% 23 This data contrasted strongly with our consumer data where Windows NT was hardly present. The data allowed us to optimize settings and to demonstrate that there were not enough UNIX users to segment. Invisible Processing Invisible processing is data collected about the respondent without asking direct questions. For example, we can use cookies to identify how often somebody visits a site. We can query their browser to find out their screen settings. We can time them to see how long the interview took to complete. Not all invisible processing is Paradata (for example the web site data collected by tracking companies such as Engage is primary data collection, not Paradata). Not all Paradata is invisible, e.g. asking people what they thought about the interview. Nevertheless, there is a large area of overlap between invisible processing and Paradata, and most of the guidelines and ethical considerations about Paradata stem from observations about invisible processing. Figure 6 show an example of the sort of information that is available from the respondent’s browser, without asking the respondent an explicit question. Figure 6: Paradata and Ethics The overarching principle of 21st Century research is informed consent. If you are collecting invisible data you should be informing respondents. The normal practice is to include a short note in the survey header with a link to a fuller Privacy Policy statement. It is not possible to list all of the invisible processing that might happen, since the researcher will not normally be aware of all of the options. Therefore, the Privacy Policy should highlight the broad types of data that are collected and how they will be used. If cookies are being used, they should be declared, along with their type and longevity. 24 2001 Sawtooth Software Conference Proceedings: Sequim, WA. If you will not be linking the Paradata to individual data you should say so. If however, you plan to use Paradata to eliminate respondents who are considered to be cheating or to be lacking in proper care, and to eliminate them from the incentive, you should say so. This only needs to be done at a general, indicative, level. For example, that interview metrics will be used to evaluate whether the interview appears to have been completed correctly and that failure to fall within these metrics may result in your data being removed from the survey and you being excluded from the incentive. For example, researchers will often exclude an interview that was completed too quickly, but it would not be advisable to warn the respondent of the specifics of this quality check. CONCLUSIONS The key observation is that all projects are experiments; all projects provide data that allow us to improve our understanding and our ability to improve future research. Millward Brown have found that Paradata has allowed us to better understand the way our research instruments are performing and is an additional tool in the quality assurance program. Amongst other uses, we find Paradata useful in: • Optimizing questionnaires • Avoiding rogue interviews • Minimizing unwanted questionnaire effects • Maximizing opportunities. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 25 REFERENCES Appel, V. and Pinnell, J. (1995). How Computerized Interviewing Eliminates the Screen-In Bias of Follow-Up Questions. Proceedings of the Worldwide Readership Research Symposium 7, p. 117. Bain, J., Arpin, D., and Appel, V. (1995). Using CASI-Audio (Computer-Assisted Self Interview with Audio) in Readership Measurements. Proceedings of the Worldwide Readership Research Symposium 7, p. 21. Couper, M.P, Sadosky, S.A., and Hansen, S.E. "Measuring Interviewer Behaviour Using CAPI" Proceedings of Survey Research Methods Section, American Statistical Association 1994, pp 845-850 Couper, M.P, Hansen, S.E. and Sadosky, S.A. "Evaluating Interviewer Use of CAPI Technology" in "Survey Measurement and Process Quality", Edited by Lyberg, L. et al, published by Wiley 1997, pp 267-285 Jeavons, Andrew (1999). "Ethology and the Web. Observing Respondent Behaviour in Web Surveys". Proceedings of the ESOMAR Worldwide Internet Conference Net Effects 2 London 1999. Jeavons, Andrew (2001). " Paradata: Concepts and Applications ". Proceedings of the ESOMAR Worldwide Internet Conference Net Effects 4 Barcelona 2001. 26 2001 Sawtooth Software Conference Proceedings: Sequim, WA. WEB INTERVIEWING: WHERE ARE WE IN 2001? Craig V. King & Patrick Delana POPULUS POPULUS has been conducting web-based data collection since early in 1999. To date we have collected approximately 230,000 completed interviews with the vast majority consisting of employee attitudinal and behavioral (adaptive conjoint analysis; ACA) studies. Response rates across all companies involved in the employee-based research (n ~130) average 51% with a range of 19% - 93%. Sample sizes for employee-based surveys vary in size from 140 to ~83,000. These surveys have also been translated into six foreign languages: Spanish, German, French, Portuguese, Dutch, and Japanese. RESPONSE RATES One reason response rates tend to be high in these employee studies is due to the companywide support of the research initiative. Organizations that participate in the research notify the affected employee that the research will be taking place, identify the vendor, and communicate support for the research. The organization provides to the vendor a spreadsheet containing names, email addresses, and other relevant information. We then send out an email invitation to participate in the research. Key elements of the invitation include the URL – generally as a hyperlink – and password. The invitations are sent using a plain text format, not HTML. And the invitations do not include any attachments. The reason for not including the text as an attachment is because some email systems do not have the capability to manage attachments. In most instances, a reminder notification is sent 7 – 21 days following the initial notification. Reminders are effective in maximizing response rates and are highly recommended (King & Delana, 2001). The reminder notifications include the URL, the individual’s password, and two notifications of the deadline. Non-employee research on the Web has primarily consisted of targeted business consumerbased research projects. We have conducted several studies in different industries to evaluate product attributes, preference, need, and purchase intent using conjoint and non-conjoint approaches. In one study, assessing company desires for enhanced product features, a telephone list was generated targeting businesses of specific sizes and industry types. Telephone calls screened potential respondents and then assessed interest in completing the research. Email addresses were captured with the understanding that respondents would receive an invitation to participate in the research via email. Response rates for this study varied depending on the incentive offered. Initially, respondents who agreed over the telephone to participate in the research were offered a charity donation in their name. For this incentive, there was a 20% response rate of people who actually agreed to participate in the survey and who provided their email address to the telephone interviewer. Response rates increased when different incentives were offered. Specifically, when electronic cash redeemable at several on-line businesses was offered, response rates increased to approximately 35%. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 27 In an adaptive conjoint study targeting IT professionals, we sent invitations through purchased email lists of IT professionals. Response rates varied from .5% to 3% depending on the source of the list. Company-supplied email lists generally result in much better response rates to Web surveys. On average, we have seen about a 35% response rate when we have email addresses of current customers supplied by the client. SAMPLING Internet-based research has several advantages over traditional research methods (Miller, 2001). Some of the main advantages include the speed at which data can be collected. For a consumer-based survey a client could easily obtain all necessary respondents over a weekend period. Emails sent to a panel or a list can be sent on Friday, with all respondents completing the survey by Monday. Another advantage is the complete lack of interviewer bias. A computer lacks any personal attributes that might affect response patterns in any systematic way. The cost of Internet data collection has been touted as being cheaper than any other methodology. In many instances, especially when a good list of email addresses is readily available, the cost per interview for Web data collection is very low. However, in studies where email addresses are not easily obtained, a traditional telephone survey may actually be cheaper. If the survey is brief and the topic is of interest, it is probably less expensive to collect data over the telephone than it would be to collect it over the Internet. The biggest disadvantage of Web data collection is the fact that the Web sample may not be representative of the population of interest. Approximately 50% of households have Internet access, while approximately 70% of the US population has access to the Internet. If people who do not have Internet access are potential or current customers there is a possibility they will behave differently than people with Internet access. While weighting procedures may be developed to adjust for differences, it is difficult to assess the exact algorithm needed to effectively weight for non-response due to lack of Web access. Another challenge in collecting data via the Web has to do with obtaining email addresses. Email addresses are difficult to obtain and there is no set protocol for email addresses like there is for telephone numbers. If a list is not readily available from the client, access to opt-in lists is an option. However, response rates may be very low. A panel of respondents solves some of the problems, but panels are expensive to develop and maintain. INCENTIVES While there was some discussion of incentives earlier in the paper, it is important to expand on that discussion in more detail. We have seen improved response rates from the same sample list when higher amounts of cash were offered. Specifically, we slowly increased the incentive from a drawing to a fixed dollar amount for each respondent. When dollar amounts exceeded $10, response rates increased dramatically. The disadvantage of offering cash is that it can quickly become very expensive to pay all respondents for their time. Additionally, it is 28 2001 Sawtooth Software Conference Proceedings: Sequim, WA. expensive and time-consuming to process many individual respondent payments. Drawings for cash and prizes may or may not be an effective incentive. The challenge is to determine the optimal amount of the reward that will be effective without spending too much money. Drawings may not be as effective for business-based research as they are for consumer research. TECHNICAL CONSIDERATIONS There are several factors to consider when developing a survey for the Web. The variability in computer configurations is large and issues such as screen size and resolution should always be considered when developing a survey. Respondents with low resolution or small monitors may have difficulty viewing the entire question or set of responses without scrolling vertically or horizontally. Anecdotal evidence suggests that respondents will quickly become frustrated with the survey if they are required to constantly scroll in order to view the question. Another limitation is related to Web browsers. Certain browsers do not support all possible Web interviewing features. We recommend that researchers carefully review the survey on several different computers with various browsers and screen resolutions to ensure that the desired visual effect is preserved across the various configurations. Many home computer users are connecting to the Web with a 56k modem. These connections are slower than DSL or T1 connections and can become very slow when the survey contains more than simple text. Weighing the balance between design and respondent capabilities is important. Many people simply do not have Web access and thus are not viable candidates for a Web interview. Others, who might have Web access, may not have the technical skills needed to complete a survey on-line. While people are becoming increasingly more proficient with computers, there are many who still do not understand some basic features, such as scroll bars and “next page” buttons. We recommend that respondents be provided easy access to a “FAQ” page to assist with simple technical problems. Our surveys contain a FAQ page link in the footer of each page. Another technical consideration has to do with user access to the survey. If a researcher is inviting respondents to take an on-line survey, we recommend using passwords that control access to the survey. Password access will ensure that only targeted respondents will complete the survey. Additionally, this approach prevents respondents from accessing the survey more than once, thereby increasing validity of the responses. However, when assigning passwords it is critical to keep a few general rules in mind. Mixing alpha and numeric characters will increase the difficulty of unauthorized access; however, it does pose additional problems. Specifically, characters such as the numeral “1”, an upper case letter “I” (i), and a lower case letter “l” (L) appear very similar and are easily confused. Other potential problems include the use of the numeral “0” (zero) and the upper case letter “O”, as well as the lower case letter “w” and two lower case letters “vv” (v + v) in passwords. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 29 We recommend using automated on-line technical support programs as well. All our surveys have an email address to request help; each person who sends an email will receive an automated response with information contained in the FAQ page as well as login instructions. MEASUREMENT It has been demonstrated that methodology does affect response patterns (Kraut, 1999), and several studies have compared the data integrity of Web surveys to other methodologies (Hezlett, 2000; Magnan, Lundby, & Fenlason, 2000; Miller, 2001; Spera, 2000; Yost & Homer, 1998). Miller reported a greater tendency for respondents to use end points when using telephone surveys than Web surveys. Specifically, there appears to be a general trend toward higher acquiescence when using a telephone compared to either mail or Web surveys (Kraut, 1999; Miller 2001). Miller believes that acquiescence associated with telephone interviews may be a function of interviewer bias; whereas, respondents taking Web surveys may have a greater feeling of anonymity and feel more comfortable providing responses that are more reflective of their true feelings and less influenced by a desire to please the interviewer. While there is no apparent difference in overall response patterns when comparing pencil and paper (P&P) surveys to Web surveys, there are a few subtle differences that seem to be emerging. When comparing overall mean responses to a question across Web and P&P, the results are mixed. Some studies find tendencies toward slightly higher means using the Web, while others find the opposite. The same holds true for missing data. Fenlason (2000) and Yost and Homer (1998) reported higher amounts of missing data (incomplete surveys), while our experience indicates very low levels of missing data. Spera (2000) reported that response rates were lower for the Web than for P&P. However, there was a consistent finding that open-ended responses were longer and more rich (i.e., detailed and specific) when they were provided on the Web than when they were provided on P&P surveys (Fenlason, 2000; Yost & Homer, 1998). COMPARISON OF WEB SURVEY TO TRADITIONAL MAIL SURVEYS In an effort to determine if medium affects response patterns, a study was conducted comparing responses from traditional mail surveys to responses from a Web survey. Respondents were current customers who had provided both valid mail and email addresses. The sample was randomly divided into two groups: an email recruit to a Web site and a traditional mail survey. The surveys were identical except that one group had a P&P mail survey and the other group received an email recruiting them to the survey site. The recruiting letter was identical in both situations except when referring to the medium. There were no significant differences in demographics between the two recruiting methods for those who completed the survey. Additionally, the overall response patterns were similar for both samples. However, as can be seen in Figure 1, there was a general tendency for the Web sample to have a slightly lower mean for each question, although only 3 of the 47 were significantly different. 30 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 1: A Comparison of Web and Mail Response Patterns Comparison of Email and Convential M ail Responses Mean Satisfaction (5 = Very Satisfied) 5 4 Significantly Mail-back E-mail 3 2 47 45 43 41 39 37 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 1 SAWTOOTH INTERNET SOFTWARE (CIW) Sawtooth Software developed their adaptive conjoint analysis (ACA) module for the Web as a response to a request from POPULUS in 1999. The ACA module was based on earlier Internet software developed by Sawtooth that could accommodate a simple questionnaire on the Web. We were pre-beta and beta testers of the ACA Web interviewing software prior to the final release. The Sawtooth Web interviewing software continues to evolve and many improvements have been added since its initial release. POPULUS continues to utilize this software for all of its Web data collection needs. Sawtooth Software Strengths In our opinion, the biggest strength of CiW is the enabling of Web ACA. Web ACA has many advantages over DOS ACA (Windows ACA is being released soon). Specifically, Web ACA: • Allows for the inclusion of more than five levels for a given attribute • Accommodates longer text labels, thus allowing more detail 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 31 • Accommodates typographical devices such as italics, bolding, and underlining at the text level • Accommodates many languages including Japanese • Allows respondents to view their attribute importance scores on a single page at the completion of the survey. Other strengths of CiW include compatibility with most older Web browsers. This is important because many respondents do not have computers that support current Web browsers, which would eliminate them from any study. Other strengths of CiW include: • The ability to randomize answer choices within a question, or to randomize a question within a block of questions on a page • Automatic progress bar indicating real time progress to the respondent • Password protection allowing only those people who are invited to complete the survey • Stop-and-restart capability • The ability to incorporate simple skip patterns • Next page button above footer, which does not force respondents to scroll to the very bottom of the page before advancing • The addition of a free form question that allows programmers greater flexibility in customizing the instrument Sawtooth Software Weaknesses As with virtually any product, there are opportunities for improving CiW. In our opinion, the biggest weakness in the current release is the inability to build lists. Some list building is possible, but it has to be accomplished by programming many “if statements” which become extremely cumbersome and require many pages and questions. Another area that could be improved is the skip patterns. Complex skip patterns are difficult to program and require the creation of many pages to accomplish the goal. A general weakness of Web interviewing software is the click and scroll features required to navigate through the questionnaire. Perhaps this is more of a function of our age and desire to use keystrokes rather than a mouse to move through a survey and make responses. In the current version, there is no feature to assist in managing quotas or to force quota sampling in real time. The only way to manage quotas is to constantly monitor the data collection and then to reprogram the questionnaire to terminate interviews that meet the quotas. 32 2001 Sawtooth Software Conference Proceedings: Sequim, WA. A couple of weaknesses that are probably more specific to our needs have to do with obtaining real-time results. We often have multiple samples taking the same survey and each participating company requires regular updates regarding response rates. It is not possible for us to obtain this information without first downloading the data and processing it. And finally, data downloads can be very time consuming, mainly because we often have files with thousands or tens of thousands of respondents. In most instances, data download times will not be a problem because of smaller numbers of interviews. CONCLUSION The Web is a constantly changing world. Software and Web browsers are updated regularly and it is important to keep abreast of these changes if one desires to conduct Web interviews. Internet penetration continues to grow at a rapid pace. According to a recent poll reported in the Wall Street Journal, approximately 50% of households now have Internet access. This is roughly a 50% increase in penetration over the past three years. People are becoming more computer literate as well. With higher penetrations of personal computer usage and an increase in Web activity, it is becoming easier for the researcher to obtain valid information through the use of computers and the Web. In our opinion, the greatest challenge in the Web interviewing world is to find respondents. Panels appear to be effective for most consumer research; however, for targeted business-tobusiness needs it may be somewhat difficult to conveniently gain access to the desired respondents. We predict that Web interviews will continue to gain share of interviews. The speed and ease at which information can be obtained makes the Web an ideal medium for many data collection needs. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 33 REFERENCES Fenlason, K. J. (2000, April). Multiple data collection methods in 360-feedback programs: Implication for use and interpretation. Paper presented at the 15th annual conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. Hezlett, S. A. (2000, April). Employee attitude surveys in multinational organizations: An investigation of measurement equivalence. Paper presented at the 15th annual conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. King, C. V., & Delana, P. (2001). Web Data Collection: Reminders and Their Effects on Response Rates. Unpublished Manuscript. Kraut, A. I. (1999, April). Want favorable replies? Just call! Telephone versus self-administered surveys. Paper presented at the 14th annual conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Magnan, S. M., Lundby, K. M., & Fenlason, K. J. (2000, April). Dual media: The art and science of paper and internet employee survey implementation. Presented at the 15th annual conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. Miller, T. W. (2001). Can we trust the data of online research? Marketing Research: A Magazine of Management and Application, 13(2). Spera, S. D. (2000, April). Transitioning to Web survey methods: Lessons from a cautious adapter. Paper presented at the 15th annual conference of the Society for Industrial and Organizational Psychology, New Orleans, LA. Yost, P. R. & Homer, L. E. (1998, April). Electronic versus paper surveys: Does the medium affect the response? Paper presented at the 13th annual conference of the Society for Industrial and Organizational Psychology, Dallas, TX. 34 2001 Sawtooth Software Conference Proceedings: Sequim, WA. USING CONJOINT ANALYSIS IN ARMY RECRUITING Todd M. Henry United States Military Academy Claudia G. Beach United States Army Recruiting Command ABSTRACT The Army’s past efforts to structure recruiting incentives ignored its prime market’s preferences. This study extends conjoint utilities by converting them to probabilities for use as primary input to a goal program. The final product determines incentive levels for Army career field and term of service combinations, and the total incentive budget requirements for a fiscal year. INTRODUCTION In recent years, the Armed Forces of the United States particularly the Army, Navy and Air Force, have been faced with the increasingly difficult task of attracting and recruiting the required number of enlistees. These factors become more significant when the U.S. economy is strong, which reduces the number of individuals who would possibly enlist. The fact that the United States has had an extremely strong economy in recent years has played a major role in effectively reducing the number of 17 to 22 year-olds who would consider enlisting in the Armed Forces. The 17 to 22 year-olds comprise the prime market segment in the U.S. for recruitment into entry-level occupational specialties. The three factors which have caused the Army to have more difficulty recruiting are: • An extremely low unemployment rate among the prime market segment • A decrease in the propensity to serve, as tracked by the Youth Attitude Tracking Survey (YATS) • An increase in the number of young people attending 2-year and 4-year colleges. As a result, the Army and the U.S. Army Recruiting Command (USAREC) are faced with offering enlistment incentives to entice those who would not otherwise serve in the Army to enlist. The problem is which incentives to offer, when, and to which occupational specialties. These questions are addressed during the Enlisted Incentive Review Board (EIRB). The current method of assigning enlistment incentives does not consider recruit preferences for incentives and thus cannot predict the number of enlistments for a given incentive nor can it evaluate the effects of new incentives. The EIRB requires a quantitative decision support tool that will accurately predict incentive effects and calculate the total cost for offering these incentives. This 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 35 paper describes the methodology used to create such a decision support tool, known as the Enlisted Bonus Distribution Model. EFFECTIVE NEEDS ANALYSIS The Enlisted Incentive Review Board requires a flexible decision support tool to do the following: • Predict the number of individuals who will enlist into a given occupational specialty for a given incentive and time of service • Determine the optimal mix of incentives to offer and to which occupational specialties • Determine the total cost for offering these incentives • Minimize the deviation from the recruiting goals for each occupational specialty Predicting numbers of individuals who will enlist for a certain incentive package requires data on recruit preferences. A choice-based conjoint (CBC) analysis will provide this preference data. A Microsoft Excel® based integer goal program model has the characteristics required to solve for the optimal mix of incentives to offer. The model is integer-based because the decision is whether or not to offer a certain incentive to a given occupational specialty for a certain term of service. In addition, a goal program realistically models the incentive environment since every occupational specialty has an annual recruitment goal. For these reasons a binary integer goal program model was selected as the best alternative. THE CHOICE-BASED CONJOINT STUDY MarketVision Research® was contracted to complete a market survey of the Army’s target population to assess the effectiveness of different enlistment incentives. MarketVision used choice-based conjoint analysis in its assessment. Relevant attributes for the CBC study included career field, term of service and incentives. There are approximately 194 entry-level occupational specialties organized into 26 career fields. Although occupational specialty data is required for the problem, career field data was obtained to reduce the number of choice tasks required for each respondent. Table 1 shows the career fields (military positions) included in the CBC study. 36 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Military Position Military Intelligence Military Police Psychological Operations Administration Aviation Operations Medical Transportation Public Affairs/Journalism Electronic Warfare/Intercept Systems maintenance Automatic Data Processing/Computers Ammunition Signal Operations Supply and Services Visual Information/Signal Air Defense Artillery Infantry Armor Combat Engineering Electronic Maintenance and Calibration Field Artillery Topographic Engineering Aircraft Maintenance Mechanical Maintenance Electronic Warfare/Cryptologic Operations General Engineering/Construction Petroleum and Water Table 1: Career Fields Terms of service from two to six years were included in the study. Incentives evaluated in the study were the Army College Fund (from $26,500 to $75,000), enlistment bonus (from $1,000 to $24,000) and a combination of both the Army College Fund and the enlistment bonus. Appendix A shows the terms of service and incentive levels tested in the study. 506 intercept interviews were conducted in malls at ten locations throughout the U.S. Respondents were intercepted at random and presented with 20 partial profile choice tasks as shown in Figure 1. W h ic h o f th e s e th r e e e n lis tm e n t s c e n a r io s w o u ld y o u c h o o s e ? 1 A ir c r a ft M a in te n a n c e 2 T r a n s p o r ta tio n 3 P e tr o le u m a n d W a te r 3 -y e a r e n lis tm e n t 5 -y e a r e n lis tm e n t 4 -y e a r e n lis tm e n t $ 8 ,0 0 0 E n lis tm e n t B o n u s $ 8 ,0 0 0 E n lis tm e n t B o n u s $ 1 ,0 0 0 E n lis tm e n t B o n u s and and and $ 4 9 ,0 0 0 A r m y C o lle g e F u n d $ 1 9 ,6 2 6 G I B ill $ 5 0 ,0 0 0 A r m y C o lle g e F u n d 4 If th e s e w e r e m y o n ly o p tio n s I w o u ld n o t e n lis t Figure 1: Choice Task 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 37 Using statistical analysis, MarketVision Research calculated utilities, that represent respondent’s preferences, for each of the attributes tested: career field (military position), term of service, Army College Fund, and enlistment bonus. Table 2 shows the utilities. Military Position Military Intelligence Military Police Psychological Operations Administration Aviation Operations Medical Transportation Public Affairs/Journalism Systems Electronic Warfare/Intercept System s maintenance Automatic Data Processing/Computers Ammunition Signal Operations Supply and Services Visual Information/Signal Air Defense Artillery Infantry Armor Combat Engineering Electronic Maintenance and Calibration Field Artillery Topographic Engineering Aircraft Maintenance Mechanical Maintenance Electronic Warfare/Cryptologic Operations General Engineering/Construction Petroleumand Water Utility E n lis tm e n t P e rio d U tility 1.5446 1.0058 0.8385 0.4037 0.3262 0.2913 0.2699 0.2505 0.2144 0.1446 0.1358 0.0617 0.0313 -0.0535 -0.1251 -0.1379 -0.1410 -0.1996 -0.2006 -0.2512 -0.5162 -0.5431 -0.5907 -0.7675 -0.7881 -1.2649 2 -y e a r 3 -y e a r 4 -y e a r 5 -y e a r 6 -y e a r 0 .4 3 2 0 0 .3 0 3 5 0 .0 3 7 0 -0 .2 4 3 4 -0 .5 2 9 1 Incentive Enlistment bonus Army College Fund Utility(per $1000) 0.0576 0.0237 Table 2. CBC Utilities The CBC utilities will be converted to probabilities and serve as a major input to the Enlisted Bonus Distribution Model. ENLISTED BONUS DISTRIBUTION MODEL The Enlisted Bonus Distribution Model is a binary integer goal program that minimizes the deviations from the recruiting goals for each military occupational specialty (MOS) while remaining within the recruiting budget. Figure 2 describes the inputs and outputs of the model: 38 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Inputs Outputs Probabilities of Selection Incentives to Offer each MOS Recruiting Goals per MOS 17-22 Year Old Population FY Recruiting Budget Binary Integer Goal Program Total Cost Incentive Costs Figure 2: Model Description The probabilities of selection are obtained from the CBC utilities calculated by MarketVision Research ®. Each career field, term of service and incentive combination is considered a product. The total product utility, Uijk, is given as the sum of the career field (i ) utility, term of service (j) utility and incentive (k) utility. Uijk = utilitycareer field + utilityterm of service + utilityincentive This product utility represents the odds in favor of a positive response to the product. These odds must then be converted to probabilities of positive response. The probability of a positive response to the product is then given by1: pijk = U ijk e 1+ e U ijk The estimated fraction, Pijk, of the 17-22 year-old population that would enlist for a specific career field (i), term of service (j) and incentive (k) is given by2: Pijk = pijk ∑∑∑ p ijk i j k This gives us the fraction of the population who would enlist into a certain career field given a term of service and incentive. The model requires the estimated fraction of the population who 1 2 Joles et al, An Enlistment Bonus Distribution Model, 1998 Joles et al, An Enlistment Bonus Distribution Model, 1998 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 39 would enlist for a certain occupational specialty (m), term of service (j) and incentive (k), Pmjk. This requires that Pijk be divided among all occupational specialties within the career field based on the percentage fill of the occupation specialty within the career field. For instance, MOS 13C has a recruiting goal of 108 out of 3350 for the field artillery career field. Therefore, Pmjk for 13C is given as: Pmjk = Pijk × (% of career field ) = Pijk × (108 / 3350) Incentive policy allows for only one incentive level to be offered to each occupational specialty for a certain term of service. For example, occupational specialty 11X may be offered the following incentives for a 2-year term of service: $2,000 enlistment bonus or $26,500 Army College Fund or $1,000 enlistment bonus plus $26,500 Army College Fund Both a $1,000 and $2,000 enlistment bonus could not be offered to 11X for a 2-year term of service. This affects the fraction of the population calculations above. Because only one incentive type can be offered, we must assume that a higher incentive level will also attract those persons who would enlist for a lower incentive level. For instance, if a $2,000 enlistment bonus is offered to occupational specialty 11X for a 2-year term of service, we will also attract those who would enlist into occupational specialty 11X for a 2-year term of service given a $1,000 enlistment bonus. The expected number of recruits, Rmjk, to enlist into occupational specialty (m), for term of service (j), given incentive (k) is given by: Rmjk = Pmjk * (17-22 year-old population) The model then uses Rmjk to determine the optimal mix of incentives to offer occupational specialty (m) to meet its recruiting goal. The decision variables in the binary integer goal program are which incentives to offer each occupational specialty. The benefits and costs for offering each of the incentives are evaluated and considered globally through out the entire solution space. That is, the effects of each incentive are evaluated with regard to their impact on the model as a whole. 40 2001 Sawtooth Software Conference Proceedings: Sequim, WA. There are four categories of constraints in the model (besides the binary constraint on the decision variables). The first category is the recruiting goal for each occupational specialty. These are goal constraints with the left-hand side as the summation of all the expected recruits from all the offered incentives and the right-hand side is the recruiting goal for that individual occupational specialty. The second type of constraint is the budget constraint. Because the budget is for all incentives and for all occupational specialties there is no budget constraint for each individual occupational specialty. The left-hand side of the budget constraint is the summation of the costs for all the incentives offered multiplied by the number of individuals who select those incentives for all occupational specialties. The right-hand side is the fiscal year recruiting budget. The third category of constraint is that only one level of each incentive (enlistment bonus, Army College Fund, and a combination of both) can be offered to each occupational specialty for a given term of service. For instance, occupational specialty 11X could be offered a $4,000 enlistment bonus for a 3-year term of service; a $50,000 Army College Fund for a 4-year term of service; and a $1,000 enlistment bonus plus a $40,000 Army College Fund for a 5-year term of service. For each occupational specialty there are fifteen of these constraints (one constraint for each type of incentive and each possible term of service). The final type of constraint is on the minimum term of service required for each occupational specialty. Each occupational specialty is assigned a minimum term of service. No incentives can be offered for terms of service less than the minimum. The objective function for the goal program minimizes the deviations from the recruiting goals for each occupational specialty. The objective function follows: MOSs TOS Incentives MINIMIZE ∑ ∑ ∑ (w U underMOS + wO overMOS ) WU represents the weight assigned for an occupational specialty attracting below/under the recruiting goal and WO represents the weight assigned for an occupational specialty attracting above/over. These weights allow the user to identify critical occupational specialties. underMOS and overMOS are over and under variables for each of the occupational specialties included in the model. This objective function allows solutions that may include local overages or shortages in order to find a global solution that minimizes the deviations over all occupational specialties. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 41 MODEL RESULTS The model contains over 17,500 decision variables and over 13,600 constraints. This size requires use of an Extended Large-Scale Solver Engine from Frontline Systems, Inc. Table 3 shows a small portion of model results. MOS TARGET (Acc seats) # Recruits No Incentives Offer Incentives # Recruits Expected Total Cost 00B1 48 24 Yes 48 354,744 11X1 8534 4267 Yes 6153 18,234,412 12B1 288 144 Yes 288 783,080 12C1 80 40 Yes 80 191,426 13B1 1513 756 Yes 1513 7,015,586 13C1 108 54 Yes 108 497,587 13D1 162 81 Yes 162 746,630 13E1 289 144 Yes 289 1,337,601 13F1 13M1 393 336 196 168 Yes Yes 393 336 1,823,214 1,548,851 Table 3: Model Results For MOS 13B1, with a goal (target) of 1513, the model has determined the optimal mix of incentives to offer to achieve this goal. The cost for enlisting 1513 into 13B1 is $7,015,586. Table 4 below shows the incentive report generated by the model. To meet the recruiting goal for 13B1 we should offer a $10,000 enlistment bonus for a 3-year term of service, $49,000 Army College Fund for a 3-year term of service, and a $4,000 enlistment bonus plus a $33,000 Army College Fund for a 3-year term of service. Total Budget Required FY01 Recruiting Budget Total # Recruits 42 $100,000,000 $100,000,000 30,708 MOS TARGET 13B1 13C1 13D1 13E1 13F1 13M1 13P1 13R1 1,513 108 162 289 393 336 338 86 Min TOS # Recruit Total Cost Incentives to Offer Expected 3 1,513 $7,015,586 3yr10EB 3yr49ACF 3 108 $497,587 3yr10EB 3yr49ACF 3 162 $746,630 3yr10EB 3yr49ACF 3 289 $1,337,601 3yr10EB 3yr49ACF 3 393 $1,823,214 3yr10EB 3yr49ACF 3 336 $1,548,851 3yr10EB 3yr49ACF 3 338 $1,559,387 3yr10EB 3yr49ACF 3 86 $395,114 3yr10EB 3yr49ACF Table 4: Incentive Report 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 3yr4EB33ACF 3yr4EB33ACF 3yr4EB33ACF 3yr4EB33ACF RESULTS Model verification involved comparing the model’s predicted recruits into an occupational specialty for a specific incentive against fiscal year recruiting data for 2000. Figure 1 shows the results for 11X. Actual vs Predicted Recruits for 11X 2000 1500 1000 500 0 3y r2 EB 33 AC 3y F r 4y 4EB r 4y 5 r5 0A EB C 40 F AC 4y F 5y r r2 EB 8EB 50 AC 6y 5yr F 1 r4 EB 0EB 40 A 6y CF r1 2E B Actual # Recruits Predicted # Recruits P-value = .363 Figure 3: Verification Results Figure 3 shows that the model does a good job of estimating recruits at the shorter terms of service, but over-estimates recruit preference for longer terms of service. Performing a simple analysis of variance for the actual versus predicted recruit data over all occupational specialties and incentives results in a p-value of .363. This p-value seems to indicate that the data may be statistically similar, although not strongly. This may be caused by study subjects being willing to accept longer periods of commitment than if they were actually signing a contract. Recruiting policies may also affect the results. Not all incentives are offered throughout the entire year. Given the data for 11X in table 3, $10,000 enlistment bonus for a 5-year term of service may have only been offered for a brief period of time. This would effectively reduce the 17-22 yearold population who were aware of this incentive. Also, not every recruit that enters a recruiting station is able to enlist into every occupational specialty. A battery of tests determines his enlistment choices and only the incentives offered to that small number of occupational specialties are made known to the recruit. So not all “products” are available and known by all recruits, which is contrary to the conjoint study assumption that all products are offered to everyone. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 43 CONCLUSION Choice-based conjoint study utilities adequately portray prime market preferences. Current recruiting policies somewhat diminish the validity of these utilities. These utilities can easily be converted to fractions of the population who would enlist for a certain recruiting product. This fraction of the population can then be used in a binary integer goal program, Enlisted Bonus Distribution Model, to determine the optimal mix of incentives to offer each occupational specialty to ensure it meets its recruiting goal. Future research should center on modeling changes in the U.S. economy on recruitment. Specifically, the CBC study utilities will change as the U.S. economy changes. Finding a relationship between economic factors and CBC utilities would allow the model data to change as economic factors change. REFERENCES Army College Fund Cost-effectiveness Study. Systems Research and Applications Corporation and Economic Research Laboratory, Inc., November 1990. Asch, Beth and Dertouzos, James. Educational Benefits Versus Enlistment Bonuses: A Comparison of Recruiting Options. Rand Corporation, 1994. Asch, Beth et al. Recent Trends and Their Implications: Preliminary Analysis. Rand Corporation, MR-549-A/OSD. 1994. Clark, Captain Charles G. Jr. et al. The Impact of Desert Shield/Desert Storm and Force Reductions on Army Recruiting and Retention. Department of Systems Engineering, United States Military Academy, May 1991. Curry-White, Brenda et al. U.S. Army Incentives Choice-based Conjoint Study. University of Louisville Urban Studies Institute, TCN 96-162, April 1997. Dolan, Robert J. Conjoint Analysis: A Manager’s Guide. President & Fellows of Harvard College, Harvard Business School, 1990. Enlisted Bonus Distribution Conjoint Study. MarketVision Research, 2000. Joles, Major Jeffery et al. An Enlisted Bonus Distribution Model. Department of Systems Engineering, United States Military Academy, February 1998. Perspectives of Surveyed Service Members in Retention Critical Specialties. United States General Accounting Office, GAO/NSDIA-99-197BR, August 1999. Pinnell, Jon. Conjoint Analysis: An Introduction. MarketVision Research, 1997. Solver User’s Guide. Frontline Systems, Inc., 1996. 44 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Appendix A Terms of Service and Incentive Levels Tested 2-Year Enlistment Incentives No Enlistment Bonus $1,000 $3,000 $5,000 $7,000 $9,000 $26,500 Army College Fund $39,000 Army College Fund $1,000 and $26,500 Army College Fund $4,000 and $26,500 Army College Fund $2,000 and $39,000 Army College Fund $8,000 and $39,000 Army College Fund 3-Year Enlistment Incentives No Enlistment Bonus $1,000 $2,000 $4,000 $6,000 $8,000 $10,000 $33,000 Army College Fund $49,000 Army College Fund $1,000 and $33,000 Army College Fund $4,000 and $33,000 Army College Fund $2,000 and $49,000 Army College Fund $8,000 and $49,000 Army College Fund 4 and 5-Year Enlistment Incentives No Enlistment Bonus $2,000 $4,000 $8,000 $12,000 $16,000 $20,000 $40,000 Army College Fund $50,000 Army College Fund $60,000 Army College Fund $75,000 Army College Fund $1,000 and $40,000 Army College Fund $4,000 and $40,000 Army College Fund $2,000 and $60,000 Army College Fund $8,000 and $60,000 Army College Fund $1,000 and $50,000 Army College Fund $4,000 and $50,000 Army College Fund $2,000 and $75,000 Army College Fund $8,000 and $75,000 Army College Fund 6-Year Enlistment Incentives No Enlistment Bonus $2,000 $4,000 $8,000 $12,000 $18,000 $24,000 $40,000 Army College Fund $50,000 Army College Fund $60,000 Army College Fund $75,000 Army College Fund $2,000 and $40,000 Army College Fund $2,000 and $50,000 Army College Fund $8,000 and $40,000 Army College Fund $8,000 and $50,000 Army College Fund $4,000 and $60,000 Army College Fund $4,000 and $75,000 Army College Fund $12,000 and $60,000 Army College Fund $12,000 and $75,000 Army College Fund 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 45 46 2001 Sawtooth Software Conference Proceedings: Sequim, WA. DEFENDING DOMINANT SHARE: USING MARKET SEGMENTATION AND CUSTOMER RETENTION MODELING TO MAINTAIN MARKET LEADERSHIP Michael G. Mulhern, Ph.D. Mulhern Consulting ABSTRACT Regardless of their market’s growth rate or competitive intensity, market leaders expend considerable resources defending and improving their market position. This paper presents a case study from the wireless communications industry that incorporates two widely used marketing strategy elements – market segmentation and customer retention. After using customer information files to derive behaviorally based customer segments, qualitative and quantitative research was conducted to gather performance and switching propensity data. After an extensive pre-test, scale reliability was evaluated with coefficient Alpha. Once survey data were collected, scale items were tested for convergent and discriminant validity via correlation analysis. Finally, retention models using logistic regression were developed for each segment. Recommendations were made to maintain and enhance the firm’s market leadership position. Specifically, we identified the strategy elements that would likely retain customers in high revenue, heavy usage segments and lower usage segments management wanted to grow. BACKGROUND This paper focuses on a portion of the segmentation and retention process undertaken by a North American provider of wireless telecommunication services. The need for the study arose when my client, the dominant player with 60+% market share, recognized two major structural changes taking place. First, its major competitor had recently allied itself with a major international telecom provider having virtually unlimited resources. Secondly, two new competitors had entered the market in the past year with the potential for more new entrants to follow. Consequently, senior management decided to segment their customer base using their customer information file of transaction data. Once segmented, primary research would offer ways to retain their existing customers. The segmentation and retention process encompassed several major stages. When contacted by the client, we learned that behavioral segmentation had already been implemented. Eleven segments had been derived from the customer information file using factor analysis and CHAID. Management identified the most relevant variable on each factor and these variables were submitted to a CHAID analysis. CHAID is a large sample exploratory 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 47 segmentation procedure. Bases for segmentation included revenue and usage variables. In this case, 11 segments were derived, four high usage and seven light usage. MAKING MODELING STRATEGIC: SETTING STUDY OBJECTIVES When Mulhern Consulting was contacted to bid on this project, management sought strategic and operation guidance on the drivers of satisfaction in each customer segment. We recommended that the focus of the effort should determine how performance affected profitability. A compromise was reached where the determinants of satisfaction, dissatisfaction, and retention were investigated. This paper will focus on the retention modeling. RETENTION MODELING Data Sources Qualitative Research Once the segments were identified, primary research was undertaken. During this stage, qualitative research preceded quantitative research. The qualitative work consisted of 22 focus groups, two with each segment. The primary purpose of the groups was to help identify the attributes employed by customers to select and assess the performance of a wireless service provider. Quantitative Research Once the attribute list was created and the survey developed, an extensive pre-test was conducted. In addition to clarifying question wording and flow, the pre-test results were used to test the reliability of the scale items. Cronbach’s alpha was the reliability test applied. Reliability scores were quite high so no modifications to the scale items were made. Reliability is a necessary component of properly constructing an attitude scale but it alone is not sufficient. To meet the sufficiency criterion, validity testing must also be performed. Consequently, once fielding was completed, construct validity was investigated. Both convergent and discriminant validity, two key components of construct validity, were assessed using Pearson’s correlation. For all pairs of attributes, both convergent and discriminant validity were statistically significant. Structuring the Modeling Problem Dependent Variable: Retention The dependent variable, retention, was conceptualized as a constant sum measure (Green, 1997). Retention was measured with a constant sum scale that asked respondents to assume that their contract was about to expire and then allocate ten points across the four competitors based upon the odds (or likelihood) of selecting each provider. Since these data were collected via 48 2001 Sawtooth Software Conference Proceedings: Sequim, WA. survey research, retention is actually a repurchase intention score rather than behaviorally based retention. Independent Variables: Retention The independent variables were performance compared to expectations on 50 scale items in five business areas: sales, network, service, pricing, and billing. Both the business area definition and the performance attributes evolved from the qualitative research. Two approaches were taken to identifying independent variables to include in the models. First, a factor analysis was run to assess the correlation among variables. Since each factor is orthogonal, the highest loading variable on each factor was considered a candidate independent variable. The second approach implements an idea proposed by Allenby (1995). He proposed that respondents at the extremes of a distribution were more valuable when developing marketing strategy than the bulk of individuals at the center of the distribution. We modified the idea and applied it to attributes in the survey. The idea was that the attributes that had “extreme” scores (i.e. the highest and lowest within a business area) might be significant predictors of retention. Essentially, we developed a set of independent variables where the competitors scored highest (i.e. best performing attributes) and the attributes where the competitors scored lowest (i.e. worst performing attributes) within each business area. Models employing each set of independent variables were constructed separately. MODELING APPROACHES CONSIDERED Three modeling approaches were considered for this study. The first, ratings based conjoint, was eliminated primarily due to the difficulty of determining objective levels for the attributes of interest. Further, since modeling by segment was required, many designs would have to have been developed at a cost that was prohibitive to the client. Finally, many attributes made ratings based conjoint a less than optimal technique. Structural equation modeling was also evaluated. Given the time and budget constraints of the project, we doubted that acceptable models by segment could be derived since management needed operational as well as strategic guidance. A regression-based approach was selected because it could handle many attributes or independent variables while maintaining individual level data, no level specification was required, and it could provide both strategic and operational guidance to management. Modeling Goals Once a regression-based approach was selected, three modeling goals were identified. First, we wanted to create a retention variable with sufficient variation in response across individuals. Since the goal of regression is to assess the degree of variation in a dependent variable explained 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 49 by one or more independent variables, more rather than less variation in the dependent variable was preferred. This required the repurchase intent question to be recoded. Secondly, we wanted to develop valid and reliable indicators of the constructs measured. To accomplish this goal, construct validity was assessed. The final goal was to build models that explained as well as predicted repurchase intent. Investigating logistic regression diagnostics and modifying the models based on what we learned helped achieve this goal. Each of these goals required actions to be taken with regard to both the data and its analysis. Goal 1: Ensuring Variation in the Dependent Variable Initial assessment of the frequency distribution for the repurchase intention variable indicated that 48% of the respondents allocated all 10 points to their current provider. In retrospect, this is not surprising since the sample consisted only of the client’s current customers. However, to ensure variation, the repurchase intention scores were recoded into these categories: Likely to repurchase=10, Unlikely to repurchase=0-6, Missing=7-9. After recoding, 52% of the remaining respondents fell into the likely to repurchase category and 48% were categorized as unlikely to repurchase. Recoding impacted the modeling in two ways. First, it ensured that sufficient variation existed, and secondly, required that logistic regression be used since the dependent variable was now dichotomous. Goal 2: Developing Valid and Reliable Attitude Scales with Construct Validity Psychometric Theory When attempting to measure attitudinal constructs, psychometric theory suggests that constructs must be both valid and reliable. Construct validity has been defined in various ways, including: • The extent to which an operationalization measures the concept it purports to measure (Zaltman et al 1973), and • Discriminant and convergent validity (Campbell and Fiske, 1959) Bagozzi (1980) expands these definitions to include theoretical and observational meaningfulness of concepts, internal consistency of operationalizations and nomological validity. He also explores several more extensive methodologies for testing construct validity: multitraitmultimethod and causal modeling. In this paper, we will focus on reliability (internal consistency of operationalizations), convergent validity, and discriminant validity. Reliability Two approaches to internal consistency of interval level, cross sectional data are widely used in marketing research — split half reliability and Cronbach's alpha. Because the manner in which separate measures of the same construct can be split is arbitrary and the value of the reliability coefficient is contingent upon this division, pointed criticism has been levied against the split half method. A widely accepted alternative is Cronbach's alpha. Alpha overcomes the arbitrary nature of the splitting decision by estimating the mean reliability coefficient for all possible ways 50 2001 Sawtooth Software Conference Proceedings: Sequim, WA. of splitting a set of items in half. Alpha estimates reliability based upon the observed correlations or covariances of the scale items with each other. Alpha indicates how much correlation we can expect between our scale and all other scales that could be used to measure the same underlying construct. Consequently, higher values indicate greater reliability. In the early stages of exploratory research, Churchill (1979) notes that alpha values of .5 -.6 are acceptable. In more confirmatory settings, scores should be higher. A general rule of thumb suggests that values of .70 and above are considered acceptable. It should be noted, however, that alpha assumes equal units of measurement in each scale item and no measurement error (Bagozzi, 1980). Table 1: Reliability results for the 77 pre-test respondents Measurement Scales by Business Area Alpha Score Sales Process Cellular Network Customer Service Pricing Billing .94 .88 .92 .89 .86 Since the reliability scores were uniformly high, no modifications were made to the attributes. Convergent and Discriminant Validity By using inter-item correlation analysis, we can determine if the attributes designed to measure a single construct are highly correlated (i.e. converge) while having a low correlation with attributes that purportedly measure a different concept (i.e. discriminate). For example, the attributes or scale items that are developed to measure customer service attributes should have high correlations with each other, while these same scale items should have low correlations with attributes designed to measure other constructs (e.g. billing, pricing, sales). 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 51 Tables 2 & 3: Selected Results for Convergent and Discriminant Validity. Convergent Validity Network1 Network2 Network3 Network4 Network1 1.00 Network2 .41 1.00 Network3 .51 .53 1.00 Network4 .48 .52 .55 1.00 Discriminant Validity Network1 Network2 Network3 Network4 Billing1 .23 .27 .27 .26 Billing2 .22 .26 .27 .26 Billing3 .24 .24 .27 .26 Billing4 .22 .19 .20 .21 Note: All results significant at P=.00, and all pairs were tested and the results were similar to those in the above tables. Goal 3: Improving Explanation and Prediction Evaluating Model Quality: Overall Model Fit A set of goodness of fit statistics was used to evaluate model quality. Pearson’s Chi Square was employed to determine how well the fitted values represented the observed values. Secondly, R SquareL assesses the degree to which the inclusion of a set of independent variables reduces badness of fit by calculating the proportionate reduction in log likelihood. With respect to prediction, Lambdap, a measure of the proportionate change in error, was employed. Also, the classification or confusion matrix was used to determine the proportion of cases correctly classified by the model. Following Menard (1995), the following decision rules guided our model assessment: 52 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 4: “Goodness of Fit” and Predictive Ability Statistical Test Decision Rule “Goodness of Fit" Chi Square R SquareL P > .05 High - .3 or higher Moderate - .2 - .3 Low – Less than .2 Predictive Ability Lambdap High - .3 or higher Moderate - .2 - .3 Low – Less than .2 IMPROVING MODEL QUALITY: LOGISTIC REGRESSION DIAGNOSTICS Since modeling is an iterative process, logistic regression diagnostics were used to obtain an initial assessment of model quality. The purpose was to identify those cases where the model worked poorly as well as cases that had a great deal of influence on the model’s parameter estimates. Following Menard’s recommendations (pp77-79), the Studentized residual was used to identify those instances where the model worked poorly. The Studentized residual estimates the change in deviance if a case is excluded. Deviance is the contribution of each case to poorness of fit. With respect to identifying cases that had a large influence on the model’s parameter estimates, two statistics were evaluated; leverage and DfBeta. Leverage assesses the impact of an observed Y on a predicted Y, and DfBeta measures the change in the logistic regression coefficient when a case is deleted. For several of the segments, these diagnostics identified cases which had undue influence on the model or cases for which the model did not fit very well. In general, these cases were few in number but, when deleted from the database, had a major impact on the explanatory and predictive capability of the segment retention models. Although a variety of other measures were undertaken to improve model quality, these were most effective in this study. The tables below illustrate this phenomenon for Segment H2. Table 5: Impact of Diagnostics on Statistical Measures Sample Size Goodness of Fit Model Chi Square R SquareL Predictive Ability Lambdap Percent Correctly Classified With Influential Cases 99 Without Influential Cases 96 32.0 0.24 44.3 0.35 0.39 76% 0.56 83% 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 53 Table 6: Impact of Diagnostics on Independent Variables (Selected disguised results) Segment Indep Variable Set Business Area Odds Ratio L1 Best Worst Worst Pricing Pricing Sales 2.3 1.7 1.7 H1 Factor Factor Sales Pricing 2.0 1.6 H4 Best Best Network Billing 4.0 3.1 Variables Included in model X1 X2 X3 With Influential Cases Significant @ 0.01 Significant @ 0.08 Not Significant Without Influential Cases Significant @ 0.00 Significant @ 0.02 Significant @ 0.00 Note: All results statistically significant at the P > .05 level. The odds ratio indicates impact of a one-unit change in the independent variable on retention. LESSONS LEARNED: RESEARCH Modeling Flexibility Logistic regression offers modeling flexibility by allowing for dependent variable modification. In this study the measurement scale of the dependent variable was changed from interval to nominal. Scale Validation Construct validity provides confidence that we are actually measuring what we think we are measuring. Advantages of Using Regression Diagnostics Diagnostics can improve the statistical indicators of model quality dramatically, especially where certain cases fit the model poorly or if there are influential cases. Further, diagnostics can suggest substantive changes to the model. 54 2001 Sawtooth Software Conference Proceedings: Sequim, WA. LESSONS LEARNED: MANAGEMENT Model Retention, Not Satisfaction The retention models statistically outperformed the satisfaction and dissatisfaction models in the majority of segments. Therefore, in this study, it was more appropriate to model retention. Segmentation Basis Variables: Revenue vs. Profit Revenue was selected by management as the initial basis variable for behavioral segmentation. However, simply because a customer generates substantial revenue does not necessarily imply s/he is highly profitable. A customer profitability score or index may be a more appropriate variable upon which to base the behavioral segmentation. Evaluate Behavioral Segmentation Solution Based on Criteria for Effective Segmentation Several criteria for effective segmentation were violated in this study. Substantial Both qualitative and quantitative primary research suggested the segments were not as homogenous as management expected. Although models were built to identify drivers in each low usage segment, management chose not to develop strategies for several low usage segments due to their small size. Actionable In addition, some of the retention drivers cut across segments so that improvements could not be segment specific. This violates the actionability criterion for segmentation. As an example, dropped calls were particularly relevant for several high usage segments. However, solving this problem requires a system-wide network upgrade. Any upgrade affects all users, not only those whose retention is driven by this attribute. This reinforces the need to match the criteria for effective segmentation with the behavioral segmentation solution. Had this been accomplished prior to the retention modeling, resources would have been allocated more efficiently. Optimizing Segmentation: Combining Behavioral with Attitudinal Segmentation Combining behavioral with attitudinal segmentation may have enhanced the segmentation scheme by identifying the attitudinal rationale for the behavior. Management decided not to pursue this alternative. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 55 RETROSPECTIVE If asked to replicate this study today, the following should be considered: Barriers to Switching Users with no ability to switch (i.e. individuals locked into their employer’s service plan) were excluded from the sample. However, for those included in the respondent database, barriers to switching may have impacted propensities to switch. These barriers could be modeled as independent variables. Response Bias in Satisfaction and Retention Research After this research was completed, Mittal and Kamakura (2000) found that repurchase intentions could be impacted by response bias in the form of demographic characteristics. When predicting actual behavior among automobile purchasers, they found the interaction between satisfaction and demographics captured the response bias among respondents. Given the time and budget, modeling within each segment could be tested for these covariates. Investigate Linkages Among Satisfaction, Retention, and Profitability Since the assumption is that increased satisfaction will lead to enhanced retention and greater profitability, attention needs to be paid to empirically testing these assumptions. Kamakura et al (2001) have begun working on this. Consider Hierarchical Bayes as a Modeling Tool Hierarchical Bayes analysis can estimate the parameters of a randomized coefficients regression model. Early indications are that HB could provide superior estimates of coefficients for each segment, by assuming that the segments are all drawn from an underlying population distribution. Since there was only one observation per respondent, it would be necessary to pool individuals within segment. However, HB is capable of estimating both the variance within segment, including that due to unrecognized heterogeneity, as well as heterogeneity among segments (Sawtooth Software, 1999). 56 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Allenby, Greg and J. L. Ginter (1995) “Using Extremes to Design Products and Segment Markets,” Journal of Marketing Research, 32, (November) 392-403. Bagozzi, Richard P. (1980) Causal Models in Marketing, New York: John Wiley & Sons. Campbell, D.T. and D.W. Fiske (1959) “Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix,” Psychological Bulletin 56: 81-105. Churchill, Gilbert A. (1979) A Paradigm for Better Measures of Marketing Constructs, Journal of Marketing Research, XVI (February), 64-73. _______ and J.P. Peter (1984) Research Design Effects on the Reliability of Ratings Scales: A Meta-Analysis, Journal of Marketing Research, XXI (November), 360-75. Cronbach, Lee (1951) “Coefficient Alpha and the Internal Structure of Tests,” Psychometrica, 16:3 (September), 297-334. Green, Paul (1997) “VOICE: A Customer Satisfaction Model With an Optimal Effort Allocation Feature,” Paper presented at the American Marketing Association’s Advanced Research Techniques Forum. Hanemann, W. Michael and B. Kanninen (1998) “The Statistical Analysis of Discrete Response CV Data,” University of California at Berkeley, Department of Agricultural and Resource Economics and Policy, Working Paper No. 798. Hosmer, David W. and S. Lemeshow (1989) Applied Logistic Regression. New York: Wiley Interscience. Kamakura, Wagner, V. Mittal, F. deRosa, and J. Mazzon (2001) “Producing Profitable Customer Satisfaction and Retention,” Paper presented at the American Marketing Association’s Advanced Research Techniques Forum. Menard, Scott (1995) Applied Logistic Regression Analysis. Sage Series in Quantitative Applications in the Social Sciences: 106. Thousand Oaks, CA: Sage. Mittal, Vikas (1997) “The Non-linear and Asymmetric Nature of the Satisfaction and Repurchase Behavior Link,” Paper presented at the American Marketing Association’s Advanced Research Techniques Forum. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 57 _________ and W. Kamakura (2000) “Satisfaction and Customer Retention: An Empirical Investigation,” Paper presented at the American Marketing Association’s Advanced Research Techniques Forum. Mulhern, Michael G. (1999) "Assessing the Impact of Satisfaction and Dissatisfaction on Repurchase Intentions," AMA Advanced Research Techniques Conference, Poster Session. Santa Fe NM. _______ and Lynd Bacon (1997) "Improving Measurement Quality in Marketing Research: The Role of Reliability," Working Paper. ________ and Douglas MacLachlan (1992) "Using Analysis of Residuals and Logarithmic Transformations to Improve Regression Modeling of Business Service Usage," Sawtooth Software Conference Proceedings. Sun Valley ID. Myers, James H. (1999) Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues. Chicago: American Marketing Association. Norusis, Marija/SPSS Inc. (1994) Advanced Statistics 6.1 Chicago: SPSS Inc. Sawtooth Software (1999) “HB-Reg for Hierarchical Bayes Regression” Technical paper accessible at www.sawtoothsoftware.com. Zaltman, G., Pinson, C.R.A., and R. Angelmar (1973) Metatheory and Consumer Research, New York: Holt, Rinehart, and Winston. 58 2001 Sawtooth Software Conference Proceedings: Sequim, WA. ACA/CVA IN JAPAN: AN EXPLORATION OF THE DATA IN A CULTURAL FRAMEWORK Brent Soo Hoo Research Analyst, Gartner/Griggs-Anderson Nakaba Matsushima Senior Researcher, Nikkei Research Kiyoshi Fukai Senior Researcher, Nikkei Research BACKGROUND Ray Poynter asserted in his 2000 Sawtooth Conference paper on Creating Test Data to Objectively Assess Conjoint and Choice Algorithms that “The author’s experience is that different cultures tend to have larger or smaller proportions of respondents who answer in this simplified way. For example, the author would assert that Japan has fewer extreme raters and Germany has more” (2000 Sawtooth Conference Proceedings, p. 150). Poynter called for more research data, and we believe we have some. HYPOTHESIS One of the big questions about using Sawtooth’s ACA or CVA conjoint programs in Japan is the probable cultural problem of selecting from the centroid (e.g., rating “4,” “5” or “6” from a standard ACA/CVA nine-point scale) on a pairwise comparison rating scale. This is what Ray Poynter calls the “simplified way.” Does this exist or not? It is argued that this is not really a problem at all, but an artifact of the society in which the data is gathered and that it leads to imprecise utility estimations that are still usable. Based on Japanese cultural homogeneity, our hypothesis is that Japanese people make different choices than Western people due to the cultural desire not to confront or be outspoken. Brand data is especially affected by this cultural desire. We have access to current data from Japan, which we analyzed for this potential problem. This is a phenomenon that has been widely acknowledged in Japan by various researchers and Sawtooth Software, but we can now examine some real data sets for this effect. Data We were fortunate to have access to eight recent ACA/CVA data sets conducted by Nikkei Research and/or Gartner/Griggs-Anderson in Japan (in Japanese characters) from 1998 to 2000. All of these studies had at least 85 respondents each, with a few of the studies having up to 300 total respondents. Four of these studies contained brand as an attribute. Nikkei Research contends that the role of brand in Japan is understated by exercises such as conjoint due to the fact that the conjoint does not properly address the external factors such as brand strength, marketing/name awareness and distribution. Why do foreign (Western) companies that have high-quality products have difficulty succeeding/selling in Japan? 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 59 Analysis Plan We compared the relationship between the utilities derived from the pairs to the utilities derived from the priors. There were differences in the results, although not with brand included as an attribute. The results were charted/graphed to show comparisons across the data available. The issue of branding was also analyzed using the data sets with the absence of brand as an attribute and the data sets with the presence of brand as an attribute. This contrast may help researchers design studies that are not culturally biased due to the introduction of branding in the conjoint exercises. We examined the available holdout data sets from our Japanese data (termed “catalog choices” by Nikkei Research) and compared hit rate accuracy for predicting the holdout or catalog choices with the ACA/CVA simulator. Then, we used the stronger analysis routines contained in Hierarchical-Bayes (HB) to improve the accuracy of predicting the holdout or catalog choices with HB-generated attribute utilities. Using this analysis could help to make future ACA simulators more accurate and less influenced by “external factors.” The result would be more accurate results from the data, which would please the clients and the researchers. In the cultural issues section, we examined the Japanese research with the researchers from Japan’s Nikkei Research. It is important to understand how Japanese culture and its homogeneous societal pressures affect branding when doing research in Japan. We will attempt to answer the question of why the data says what it does and how Japan is culturally different from the Western world. Many Western corporations are wondering how to penetrate the Japanese market. Ask Anheuser-Busch what happened to Budweiser beer in Japan; it did not fare well against the native Japanese competitive brands. We will look at some historic case studies from Japan of Western products’ inability to penetrate the market. What is strange is that in Japan there is much adoration of American cultural items/products, yet few of the big American brands/products are strong in Japan. Finally, in relation to the use of ACA/CVA in Japan, we have unique access to a particular study fielded in Japan in November 2000. Data collection staff filled out a research methods observational worksheet on each respondent. Quantitative and qualitative feedback from the respondents on the ACA exercise and administration were gathered. This paper contains rich feedback data that will help the reader understand the Japanese mindset and reaction to choice exercises of this sort. CENTROID ANALYSIS We looked at ratings from the nine-point bidirectional scale from ACA and CVA pairs comparisons. A full profile CVA study was included to compare against the ACA partial profile routines. Does full profile (CVA) make a difference when compared to partial profile (ACA)? 60 2001 Sawtooth Software Conference Proceedings: Sequim, WA. OUR HYPOTHESIS We believe that Japanese people select answers from the centroid (4, 5, 6) rather than from the outliers (1, 2, 3 or 7, 8, 9). Well-known Japan expert/researchers, George Fields, Hotaka Katahira and Jerry Wind, also addressed this subject in Leveraging Japan: “Japanese survey respondents tend to answer in a narrower range than their Western peers. Early in his career in Japan, George Fields was involved in a project in which more than 30 concepts were to be screened, using the same techniques used in the United States, the United Kingdom, Germany and Australia. In each case, the most critical measure was a “top of the box” rating on a certain scale. On this basis, all concepts failed in Japan. Worse yet, the data were non-discriminating—that is, the figures were similar for all concepts. The client was warned of this possibility. However, the experience of all the other countries could not be ignored, and the client had the prerogative to conduct the exercise. The data were discriminating to some extent when the two top boxes were combined, but this raised a quandary in that one could not be sure whether the discrimination simply indicated differences between mediocrities (some are not as bad as others) or superiority of some over others.”1 1 Fields, George and Katahira, Hotaka and Wind, Jerry, Leveraging Japan, San Francisco, Jossey-Bass Inc., 2000, p. 283. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 61 Fields et al. continue: “The ‘theory,’ simply stated, is that when confronted with a verbal seven-point scale, or a range from ‘excellent’ to ‘terrible,’ Westerners tend to start from the extremes. They consider a position quickly and then perhaps modify it, which means working from both ends and moving toward the center. The Japanese, on the other hand, tend to take a neutral position or start from the midpoint and move outward, seldom reaching the extremes, hence the low top-of-the-box ratings.”2 This is not a one-time occurrence. In another study, “A company that screened advertising concepts in both Japan and the United States found marked differences in responses to the survey. The verbal scale was far less sensitive in Japan. Far fewer said, ‘I like it very much,’ or ‘I dislike it very much.’ They clung to the middle. About 20 percent of the American respondents chose the highest box compared to just 7 percent of Japanese.”3 Fields et al. concludes, “Using even number scales, or those without a strict midpoint, doesn’t really resolve the issue. While this explanation is a little pat, it fits our usual observations of the Japanese being very cautious to take up a fixed position before all known facts and consequences are weighed. How, then, can one give a clear opinion on a product or an advertisement in a single-shot or short-term exposure? Here, time is the substance rather than the number of exposures.”4 Many researchers have observed it, isn’t it time that we try to quantify this phenomena? OUR CRITERIA Defining the “centroid” people: 50% or more of the pairs answers fall in the range of 4, 5, 6. Defining the “outlier” people: 50% or more of the pairs answers fall in the range of 1, 2, 3 or 7, 8, 9). OUR STUDIES/DETAILS Study 1—Nikkei Japanese Finance Study (July 2000): Tokyo and Osaka (n=307), six attributes with 30 total attribute levels, 37 pairs shown, ACA 3.0 converted to ACA 4.0 for analysis purposes, no holdouts/catalog, no brand attribute. Study 2—Gartner Networking Product Study (November 2000): n=123, 14 attributes with 40 total attribute levels (client designed the ACA design, not Gartner/Griggs-Anderson), 35 pairs shown, ACA 3.1 for data collection converted to ACA 4.0 for analysis purposes, has holdouts, no brand attribute; respondent observation worksheet and respondent comments collected by interviewers. Study 3—Nikkei Japanese Business Car Study (February 1998): n=107, five attributes with 15 total attribute levels, ACA 3.0, 12 pairs shown, no holdouts/catalog, has brand attribute. 2 Ibid. Fields et al., p. 283. Ibid. Fields et al., p. 283. 4 Ibid. Fields et al., p. 283-284. 3 62 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Study 4—Nikkei Japanese Passenger Car Study (February 1998): n=185, five attributes with 13 total attribute levels, ACA 3.0, eight pairs shown, no holdouts/catalog, has brand attribute. Study 5—Nikkei Japanese Software Study (March 1999): n=85, five attributes with 19 total attribute levels, ACA 3.0, 16 pairs shown, no brand attribute, has holdouts/catalog. Study 6—Nikkei/Gartner Japan/America Cross-Cultural Research Study (July 2001): Japanese segment (n=228), eight attributes with 33 total attribute levels, ACAWeb with CiW for demographics, 30 pairs shown, has holdouts/catalog, has brand attribute. Study 7—Nikkei/Gartner Japan/America Cross-Cultural Research Study (July 2001): American segment (n=85), eight attributes with 33 total attributes levels, ACAWeb with CiW for demographics, 30 pairs shown, has brand attribute, has holdouts/catalog. Study 8—Gartner Japanese Networking Product Study (spring and summer 1998): n=250, eight attributes with 25 total attribute levels, CVA 2.0 full profile design using Japanese Ci3, 30 pairs shown, has brand attribute, no holdouts/catalog. DATA PROCESSING It required extensive formatting and data processing to get Japanese ACA 3.x format data into usable ACA 4.0 and ACA/HB formats. This work was done using basic text editors and individually sifting through the ACD interview audit trail file for each study. ACA/HB does not work with ACA 3.x format ACD files. Additional statistical work was done in SPSS and SAS. Due to client confidentiality, some studies/attributes will be heavily masked. We have shown which attributes and levels go together when the exact attribute names and levels cannot be disclosed. Our hope is that this information will be useful even in its masked form. We have identified at least one attribute as much as possible in each data set, to give the reader an idea of how the brand or price attribute reacted. The results of the centroid/outlier analysis follow. Centroid vs. Outliers—Percentage of Total Respondents Study Centroid Outlier 1 2 3 4 5 6 (PCJ) 7 (PCE) 8 (CVA) 24.8% 24.4% 51.4% 57.8% 28.2% 38.6% 22.4% 36.8% 75.2% 75.6% 48.6% 42.2% 71.8% 61.4% 77.6% 63.2% Note: Different research topics, respondent types and different studies can yield different results. If the two Car studies (Studies 3 and 4) and the PCE study (Study 7) are excluded, an average of 30.56% centroid and 69.44% outlier is the result. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 63 DATA TO COMPARE: A STUDY WITH BOTH JAPANESE AND ENGLISH SEGMENTS Japan/America ’Cross-Cultural Research Study. Data collection occurred July 2001. Nikkei Research hosted and programmed the survey from its Japanese Web server (web.nikkei-r.co.jp/pcj or pce), and fielded dual studies running in Japanese and English at approximately the same time. The research topic was notebook computers. Brand was included in the attribute set. The software used was ACA/Web with CiW. The study included holdouts which were the same for all respondents (Japanese/American). For the Japanese segment, the Nikkei Research Panel was used for list in Japan—3000 e-mail invitations were sent. No reminder e-mails were sent. For the American segment, Gartner Panel invitees was used for the list source—1921 e-mail invitations were sent, 1660 reminder e-mails were sent. To qualify, respondents must be planning to purchase a notebook computer in the next year. This question was asked in the first four questions to disqualify anyone not planning to purchase during this time period. ’We spent a lot of time analyzing the data from these two studies, since they are the only pair that were done in both countries. Since this was not academic research, we did not have the luxury of an American or European segment for the other jobs. The cost of doing research in Japan is expensive. Clients cannot always, or do not always, want to conduct studies in multiple countries. Diagnostics Japanese completes were 228 + 758 disqualified (but tried to participate). 7.6% response rate for qualified respondents to complete the study. 32.86% response rate overall. American completes were 85 + 91 disqualified (but tried to participate). 4.4% response rate for qualified respondents to complete the study. 9.16% response rate overall. Japanese part-way terminations was n=35. American part-way terminations was n=40. The Japanese are more likely to finish what they start based on the ratio of total number of completed interviews to part-way terminations. The ACA task is one that we have found takes Japanese people generally more time than Westerners. We hypothesize that this is because Japanese people are very detail-oriented and they read the attributes in the pair-wise comparisons closely. Respondent Co-Ops There was a co-op prize drawing of 1000 Yen (US$8) for 50 respondents in Japan (US$400 total) and a co-op prize drawing for a unique Japanese item worth $10 for 25 respondents in America (approximately US$250 total). Comments from American respondents were positive for the unique omiyage or gift. 64 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Demographics Purchase Intention Japan (n=228) 1-6 months 102 7-12 months 126 Japan % 44.74% 55.26% America (n=85) 42 43 America % 49.41% 50.59% Gender Male Female Japan % 52.19% 47.81% America (n=85) 68 17 America % 80.00% 20.00% Japan (n=228) 119 109 Age Distribution 0-19 60 20-24 50 25-29 40 30-34 35-39 30 40-44 20 45-49 50-54 10 55-59 60+ 0 Japan (n=228) America (n=85) Age 0-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+ Refused Japan (n=228) 2.63% 9.65% 16.67% 25.00% 23.68% 13.60% 4.82% 2.63% 0.88% 0.44% 0.00% Refused America (n=85) 1.18% 0.00% 8.24% 8.24% 20.00% 20.00% 23.53% 8.24% 7.06% 2.35% 1.18% 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 65 Holdout Specifications Attribute Brand CPU Brand Proc. Speed RAM Display Size Extra Drive Weight Price Holdout 1 IBM Pent. III 600 MHz Holdout 2 Sony Celeron 600 MHz Holdout 3 Toshiba Celeron 800 MHz Holdout 4 NEC Athlon 800 MHz Holdout 5 COMPAQ Athlon 1.0 GHz Holdout 6 Dell Pent. III 1.0 GHz 128 MB 12.1” 64 MB 12.1” 128 MB 14.1” 128 MB 14.1” 128 MB 14.1” 64 MB 14.1” None 4.4 lbs. $1,500 DVD 4.4 lbs. $1,500 DVD 6.6 lbs. $2,000 CD-RW 8.8 lbs. $1,500 CD-RW 6.6 lbs. $2,000 CD-ROM 6.6 lbs. $1,000 Holdout Results (averages) 0-to-100 scale—purchase likelihood Holdout Holdout 1 (IBM) Holdout 2 (Sony) Holdout 3 (Toshiba) Holdout 4 (NEC) Holdout 5 (Compaq) Holdout 6 (Dell) Japan (n=228) 36.39 47.00 36.73 38.74 30.91 30.91 America (n=85) 31.32 38.11 45.55 44.50 51.11 61.97 As our discussant Ray Poynter noted, we put the Japanese in a tough choice situation. By marrying the better features to the least preferred brand in the Japanese segment (Dell), the desirability of Holdout 6 was dampened. The exact opposite problem occurs with Holdout 2, the Sony product, which is the most preferred Japanese brand yet it has a poor specification on the other attributes. The problem doesn’t exist to that extent on the American holdouts, although there is not a really strong holdout overall with Holdout 6 leading at only a 61.97 average out of a possible 100 points. Obviously, lack of differentiation in the holdouts can lead to lower hit rate on the predictions. We discussed this with the conference attendees to explore ideas on holdout formulation. It is not an easy choice, since a clearly superior holdout may predict better, but may not accurately represent the marketplace. 66 2001 Sawtooth Software Conference Proceedings: Sequim, WA. ACA Holdout Prediction Study 2 Sample size n=123 # of accurate predictions (ACA) 84 % of accurate predictions 67.74% Note: Client-generated holdouts, three “bad” options led most to one “good” option. Study 5 Sample size n=85 # of accurate predictions (ACA) 24 % of accurate predictions 28.23% Note: No brand in ACA exercise, study on existing products, brand was shown in holdouts, strong brand affects purchasing. Study Sample size 6 (PCJ) n=228 # of accurate predictions (ACA) 97 % of accurate predictions 42.54% Study Sample size 7 (PCE) n=85 # of accurate predictions (ACA) 49 % of accurate predictions 57.64% Note: Nikkei Research created holdouts based on actual products in catalogs; however, there are no clear winners as shown in holdout preferences. There is pressure between priors and pairs. This is a hard task for Japanese respondents, since the best brand was coupled with poor specs and the worst brand was coupled with better specs. HB Holdout Prediction Study 2 Sample size n=123 # of accurate predictions (HB) 81 % of accurate predictions 65.85% Note: Client-generated holdouts, three “bad” options led most to one “good” option. HB didn’t help. Study 5 Sample size n=85 # of accurate predictions (HB) 26 % of accurate predictions 30.58% Note: No brand in ACA exercise, study on existing products, brand was shown in holdouts, strong brand affects purchasing. There was slight improvement with HB. Study 6 (PCJ) Sample size n=228 # of accurate predictions (HB) 96 % of accurate predictions 42.10% Study 7 (PCE) Sample size n=85 # of accurate predictions (HB) 49 % of accurate predictions 57.64 % Note: Nikkei Research created holdouts based on actual products in catalogs; however, there are no clear winners as shown in holdout preferences. There is pressure between priors and pairs. This is a hard task for Japanese respondents, since the best brand was coupled with poor specs and the worst brand was coupled with better specs. HB didn’t help. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 67 PRIORS VS. PAIRS UTILITIES Question: Is there a difference between the self-explicated priors utilities and the pairs utilities? Hypothesis: There will be more Japanese in the centroid because they are culturally less likely to stand out/be outspoken. ACA/HB were used to calculate pairs utilities, since ACA 3.x does not compute these. Nikkei Research programmed to calculate the priors utilities from the raw ACD logfiles based on Sawtooth Software’s methods from ACA 4.0. ACA PTS files were generated, just showing mean utilities for comparison. Significant difference is at the 95% confidence level. Shown below are the Priors/Pairs for Study #6 & #7. All other Priors/Pairs utilities are in Appendix A. Priors vs. Pairs Utilities n=228 Study 6 (PCJ): Priors Pairs PC Brand—Sony 36.19* 39.50* 13.3" display PC Brand—Dell 12.80 11.42 14.1” display PC Brand—IBM 25.05* 28.16* 15.0” display PB Brand—NEC 23.77* 29.94* No extra drive PC Brand—Toshiba 21.27* 26.59 CD-ROM drive PC Brand—Compaq 11.66 12.40 DVD-ROM drive CPU Pentium III 37.26* 33.38* CD-RW drive CPU Crusoe 13.75* 9.13* CD-RW/DVD-ROM CPU Celeron 10.67 8.88 4.4 lbs CPU Athlon 7.06 7.35 6.6 lbs CPU Speed 600 MHz 0.00* 2.39* 8.8 lbs CPU Speed 800 MHz 23.70* 16.69* $1,000 CPU Speed 1.0 GHz 47.41* 25.3.5* $1,500 64 Mb RAM 0.00* 0.71* $2,000 128 Mb RAM 26.93* 24.02* $2,500 256 Mb RAM 53.86* 39.29* $3,000 12.1” display 7.55 6.58 * Significant difference at 95% confidence level. n=85 Study 7 (PCE): Priors PC Brand— Sony 25.05 PC Brand—Dell 26.72* PC Brand—IBM 23.72 PC Brand—NEC 7.08 PC Brand—Toshiba 23.03 PC Brand—Compaq 19.31 CPU Pentium III 40.42 CPU Crusoe 9.03* CPU Celeron 8.55 68 Pairs 24.28 19.84* 22.81 7.83 23.83 18.84 41.50 12.99* 6.74 13.3” display 14.1” display 15.0” display No extra drive CD-ROM drive DVD-Rom drive CD-RW drive CD-RW/DVD-ROM 4.4 lbs. Priors 15.00 27.21* 33.19* 2.50* 30.24* 36.86* 40.59* 46.46* 47.14* 20.97 0.45* 56.18 42.14* 28.09* 14.05* 0.00* Pairs 15.47 21.26* 25.88* 0.57* 33.71* 41.34* 56.33* 67.22* 35.33* 22.14 1.53* 55.99 45.64* 33.80* 19.83* 2.19* Priors 13.22* 33.68* 43.92 0.55 38.92 36.01* 37.10* 44.86* 42.25* Pairs 16.97* 30.45* 40.25 0.15 35.60 41.91* 52.27* 71.76* 30.95* 2001 Sawtooth Software Conference Proceedings: Sequim, WA. n=85 (con’t) Study 7 (PCE): Priors Pairs CPU Athlon 19.22* 26.08* 6.6 lbs. CPU Speed 600 MHz 0.00* 1.15* 8.8 lbs CPU Speed 800MHz 24.33* 14.84* $1,000 CPU Speed 1.0 GHz 48.66* 21.33* $1,500 64 Mb RAM 0.00 0.17 $2,000 128 Mb RAM 27.53 27.09 $2,500 256 Mb RAM 55.06* 42.81* $3,000 12.1” display 0.27 0.71 *Significant difference at 95% confidence level. Priors Pairs 22.09 19.47 0.50 0.52 51.56 55.27 38.67 41.32 25.78* 31.15* 12.89* 17.50* 0.00* 1.61* DISCUSSANT ISSUES Ray Poynter gave us some valuable feedback on the shape of the data in comparison to the absolute values of the data. In his ad hoc analysis on the aggregate data, there was a lot of shape consistency. We did a frequency distribution of the answer range (one through nine) for Studies 6 and 7. There were approximately two times the number of pairs answers in the “5” midpoint position for Japanese. Frequency Distribution of Study 6 (PCJ)—30 pairs x 228 respondents 1 - 7.7% 2 - 5.2% 3 - 15.5% 4 - 10.0% 5 - 18.4% 6 - 12.7% 7 - 17.1% 8 - 5.1% 9 - 8.3% Frequency Distribution of Study 7 (PCE)—30 pairs x 85 respondents 1 - 6.8% 2 - 7.5% 3 - 16.2% 4 - 10.4% 5 - 9.3% 6 - 11.8% 7 - 20.7% 8 - 7.5% 9 - 9.6% SHAPE VS. ABSOLUTE VALUES Study 6 (PCJ) PC Brand—Sony PC Brand—Dell PC Brand—IBM PC Brand—NEC PC Brand—Toshiba PC Brand—Compaq Priors 36.19 12.80 25.05 23.77 21.27 11.66 Pairs 39.50 11.42 28.16 29.94 26.59 12.40 Study 7 (PCE) PC Brand—Sony PC Brand—Dell PC Brand—IBM PC Brand—NEC PC Brand—Toshiba PC Brand—Compaq Imp rSquare distance 24.53 28.08 Imp 0.95 rSquare 9.46 distance Priors 25.05 26.72 23.72 7.08 23.03 19.31 Pairs 24.28 19.84 22.81 7.83 23.83 18.84 19.64 16.45 0.84 7.08 Note: Brand is more important in Japan; top American brands are not worth much in Japan. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 69 Study 6 (PCJ) CPU Pentium III CPU Crusoe CPU Celeron CPU Athlon Imp rSquare distance Priors Pairs Study 7 (PCE) 37.26 33.38 CPU Pentium III 13.75 9.13 CPU Crusoe 10.67 8.88 CPU Celeron 7.06 7.35 CPU Athlon 30.20 26.03 Imp 0.98 rSquare 6.30 distance Priors 40.42 9.03 8.55 19.22 31.87 Pairs 41.50 12.99 6.74 26.08 34.76 0.94 8.20 Note: CPU chip is less of an issue in Japan. Only Pentium will do. America has plus for Pentium and Athlon. Study 6 (PCJ) CPU Speed 600 MHz CPU Speed 800 MHz CPU Speed 1.0 GHz Priors Pairs 0.00 2.39 23.70 16.69 47.41 25.35 Study 7 (PCE) Priors Pairs CPU Speed 600 MHz 0.00 1.15 CPU Speed 800 MHz 24.33 14.84 CPU Speed 1.0 GHz 48.66 21.33 Imp rSquare distance 47.41 22.96 Imp 0.98 rSquare 23.27 distance 48.66 20.18 0.96 28.95 Note: CPU speed shows weakness of priors. Attribute looks important until traded off. Identical results in Japan/America. 64 Mb RAM 128 Mb RAM 256 Mb RAM 0.00 0.71 64 Mb RAM 26.93 24.02 128 Mb RAM 53.86 39.29 256 Mb RAM 0.00 0.17 27.53 27.09 55.06 42.81 Imp rSquare distance 53.86 38.58 Imp 0.99 rSquare 14.87 distance 55.06 42.64 0.98 12.26 Note: Similar weakness of priors, but not as pronounced. Attribute looks important until traded off. Identical results in Japan/America. Study 6 (PCJ) 12.1" display 13.3" display 14.1" display 15.0" display Priors 7.55 15.00 27.21 33.19 Pairs 6.58 15.47 21.26 25.88 Study 7 (PCE) 12.1" display 13.3" display 14.1" display 15.0" display Imp rSquare distance 25.64 19.30 Imp 0.96 rSquare 9.49 distance Priors 0.27 13.22 33.68 43.92 Pairs 0.71 16.97 30.45 40.25 43.65 39.54 0.98 6.18 Note: Screen size is less important in Japan—footprint/size is a hidden attribute. The importance of this in Japan pulls opposite of large screen sizes. 70 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Study 6 (PCJ) No extra drive CD-ROM drive DVD-ROM drive CD-RW drive Combo drive Priors 2.50 30.24 36.86 40.59 44.46 Pairs 0.57 33.71 41.34 56.33 67.22 Study 7 (PCE) No extra drive CD-ROM drive DVD-ROM drive CD-RW drive Combo drive Imp rSquare distance 41.96 66.65 Imp 0.94 rSquare 28.31 distance Priors 0.55 38.92 36.01 37.10 44.86 Pairs 0.15 35.60 41.91 52.27 71.76 44.31 71.61 0.84 31.62 Note: The opposite of the priors problem occurs. The priors cannot fully express the importance of the attribute in the pairs. The machine must be able to read CDs—a “no brainer.” Study 6 (PCJ) 4.4 lbs 6.6 lbs 8.8 lbs Priors Pairs Study 7 (PCE) 47.14 35.33 4.4 lbs 20.97 22.14 6.6 lbs 0.45 1.53 8.8 lbs Priors Pairs 42.25 30.95 22.09 19.47 0.50 0.52 Imp rSquare Distance 46.69 33.80 Imp 0.96 rSquare 11.92 Distance 41.75 30.43 0.99 11.60 Note: Weight is less relevant when assessed in pairs comparisons. Study 6 (PCJ) $1,000 $1,500 $2,000 $2,500 $3,000 Priors 56.18 42.14 28.09 14.05 0.00 Pairs 55.99 45.64 33.80 19.83 2.19 Study 7 (PCE) $1,000 $1,500 $2,000 $2,500 $3,000 Imp rSquare Distance 56.18 53.80 Imp 0.99 rSquare 9.12 Distance Priors 51.56 38.67 25.78 12.89 0.00 Pairs 55.27 41.32 31.15 17.50 1.61 51.56 53.66 1.00 8.57 Note: Price is important, but behind the drive attribute. Similar results occur in Japan and the U.S. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 71 BRAND ISSUES We looked at “brand” and “no brand” in the pairs and the centroid/outlier issues to determine whether the presence or absence of brand attribute drove answers to the centroid or outlier. The option to do this analysis was only available when brand included in attributes (four studies). Surprisingly, brand did not seem to drive answers to the centroid or outlier. Perhaps the Japanese are not as brand-conscious as previously thought. Literature notes that this brand awareness and loyalty is fading. Study 3 4 6 (PCJ) 7 (PCE) No brand n=55 n=120 n=87 n=20 % centroid 51.40% 64.86% 38.15% 23.52% Brand n=57 n=95 n=86 n=17 % centroid 53.27% 51.35% 37.71% 20.0% Qualitative Feedback—From Study 2’s Respondent Observation Worksheet • 36% of respondents asked the interviewer(s) five or more questions during the ACA process. • 48% of respondents looked at the product definition/glossary sheet only once. (It is a good idea to include a product definition/glossary sheet that goes over all the attributes and levels used in the conjoint study, especially when researching technical products.) In U.S. studies, Gartner’s experience has generally seen more usage of the product definition/glossary sheet. • 36% of respondents took approximately 10 seconds to make a choice in pairs at the start of the pairs section. • 33% of respondents took approximately five seconds to make a choice in pairs at the end of the pairs section. • 18% of Japanese respondents did not seem to be comfortable with the ACA tasks. • 36% of Japanese respondents said unprompted that ACA was a hard task. Many comments came back from Japanese research participants on ACA Study 2. The top five comments follow: 72 • “Seems like a psychological exam/analysis.” • “Too long. Too many questions/pairs.” • “Difficult” choices. Exhausting task.” 2001 Sawtooth Software Conference Proceedings: Sequim, WA. • “I don’t know if I’m being consistent with answers from start to finish. Inconsistency.” • “I’m curious about total results and analysis to come—how the data/research design is analyzed.” CULTURAL ISSUES Japan is the second largest consumer market in the world.5 Japan was ranked second in Gross Domestic Product (GDP) at 15.8% of the total world economy comparing to the U.S.’s 25.5%. This ranking is even more impressive considering the smaller size of the population of Japan in comparison to the U.S., and the rest of the world for that matter. Even with the crash of the Asian markets in the past few years, Japan is an economic power and a market worth pursuing. If anything, the recent economic downturns have increased opportunities for Western companies in Japan. Japan’s number two automobile manufacturer, Nissan, had to sell 36.8% of the company to French automobile manufacturer Renault in order to keep afloat.6 Nissan has returned to profitability since that happened. Historically ”sky high” real estate prices have evened out, approaching levels for similar space in the U.S.7 As the real estate and currency have equalized, more and more Western companies have gone to Japan in search of profits from this lucrative market. Japan is culturally a hard place to break into due to cultural heterogeneity/insularity. Being an island nation with a unique language and a strong work ethic has kept foreign labor and immigrants to a minimum. The Japanese people did not need foreign workers to work the farms as other countries did. The history of Japan has made things happen a certain way. In the famous management book, Theory Z, William Ouchi states that “this characteristic style of living paints the picture of a nation of people who are homogenous with respect to race, history, language, religion and culture. For centuries and generations these people have lived in the same village next door to the same neighbors. Living in close proximity and in dwellings which gave little privacy, the Japanese survived through their capacity to work together in harmony. In this situation, it was inevitable that the one most central social value which emerged, the one value without which the society could not continue, was that an individual does not matter”8 Ouchi discusses the basis for teamwork which came from the need to cooperate to farm. He makes comparisons of cooperative clusters of Japanese households around Japanese farms and the space separation of homesteads in the U.S.9 The very basis of how the Japanese have been raised in an environment of teamwork and cooperation stands against the American ideals of self-sufficiency and independence. Japanese society is built on this environment of teamwork and working things out. One thing that is not as pronounced in Japan is the large amount of lawyers and lawsuits that are part of the everyday culture in the U.S. “In 1998 Japan had one lawyer for every 6600 people, the lowest per capita figure amongst major industrialized nations. In the U.S. at the time 5 Ibid. Fields et al., p. 9-11. Kageyama, Yuri, “Bargain Shopping in Japan,” The Oregonian, 8/15/2001, Section B, p.1-2. 7 Ibid. Kageyama, p. 1-2, and Fields, p. 3. 8 Ouchi, William G., Theory Z, New York, Avon Addison-Wesley Publishing Company, 1981, p.54-55. 9 Ibid. Ouchi, p.54-55. 6 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 73 there was one lawyer for every 300 people and one for every 650 people in the ”U.K.”10 The implicit cultural message is to avoid confrontation. For the Japanese, this goes hand in hand with the desire for long-term relationships. These types of lawyer statistics “clearly reflect the American concern for legal rights, which sharply define many relationships in the U.S., both professional and personal. The Japanese, on the other hand, are less concerned about legal rights and duties and what’s legally mine or yours than about the quality of a relationship in terms of longevity and mutual supportiveness.”11 Societal politeness in Japan is something that Westerners are not very familiar with. Our discussant Ray Poynter commented that of the Western nations, perhaps the U.K. is the most familiar with the levels of societal politeness as a result of an awareness of class structure and royalty in the U.K. From the measured way that one introduces oneself in a business setting to the tone/vocabulary of the Japanese language in precise situations, Japan is unlike Western countries. Anyone who has done business in Japan can tell you of the formal presentation of meishi or business cards.12 Entire books and chapters in books about Japan are devoted to the differences in communication styles.13 In the West, we work in polarized black and white distinctions. In Japan, it is more shades of gray. It would be hard to get agreements immediately in Japanese business. The answers given seem to be noncommittal to Westerners. Part of this is attributable to the need for a group decision. The people you may be meeting with need to take the proposal back to their team for a group decision and consensus. The lack of an immediate answer could also be met with just a smile or silence from a Japanese person. Instead of giving you a strict “no” and causing you a loss of face, they would rather be ambiguous. The concept of “face” is also very foreign to Westerners. The basis of Japanese society on social surface harmony calls for not embarrassing anyone. “In tandem with the Japanese desire for saving face and maintaining surface harmony is their aversion to the word ‘no.’”14 The Japanese person will probably say something like “cho muzukashii” or “it’s a little bit difficult” when pressed for an answer that may be uncomfortable to either the questioner or respondent. Although this is changing with the most recent generations of Japanese people, there is a societal unwillingness to stand out of crowd. Until the past 10 or so years, it was unheard of to dye hair in Japan, now hair dye is a market estimated at 53 billion Yen (approximately $530 million).15 However, for the general working class Japanese, they strive to just fit in. Workers are often supplied with company uniforms and almost anyone who has visited Japan on business is familiar with the dark suit of a salary man. Work units are set up in teams with management working elbow to elbow with lower echelon employees. In a Japanese office, there is a lack of Western cubicles and private space. People sit across from each other on communal undivided tables. This adds to the teamwork in the workplace that Japanese people are famous for. In the U.S., the cubicles are individually oriented. A Web commentary from the Asia Pacific 10 Melville, Ian, Marketing in Japan, Butterworth-Heinemann, 1999, p.60-61. Deutsch, Mitchell F., Doing Business with the Japanese, New York, Mentor NAL, 1983, p.68. 12 Rowland, Diana, Japanese Business Etiquette, New York, Warner Books, 1985, p.11-17. See also commentary on translation tone/context in Melville p. 121. See also commentary on translation/market research in Christopher, Robert C., American Companies in Japan, New York, Fawcett Columbine, 1986, p.128. 13 Shelley, Rex, Culture Shock! Japan, Kuperard, 1993, p. 116-135. 14 Ibid. Rowland, p.31, Melville, p.107, and Deutsch, p. 80-83. 15 Ibid. Fields et al. p. 12. 11 74 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Management Forum describes the Japanese concept of “true heart” or magokoro which describes the Japanese psyche: “a person who exhibits all the characteristics that have traditionally been attributed to the ideal Japanese—one who meticulously follows the dictates of etiquette, is scrupulously truthful and honest, can be trusted to fulfill all his obligations, and will make any sacrifice necessary to protect the interests of friends or business partners”16 That is not a common Western ideal, where interest in individual needs takes precedence. This word ‘individualism’ in Japanese, kojinshugi, is noted by Japan-based American journalist Robert Whiting as a negative connotation. “The U.S. is a land where the hard individualist is honored…In Japan, however, kojinshugi is almost a dirty word.”17 Japan has on average 2.21 more levels in product distribution structure than do Western countries.18 This leads to less profit margins at each step and potentially higher prices. The basis of these relationships goes back to the keiretsu or aligned business group. The large companies of Japan have been around for hundreds of years; they have built business relationships that cross into many product areas and distribution nets. The alignment of these companies as suppliers, consumers and distributors of component goods and products ties all large-scale Japanese businesses together. From Dodwell’s Industrial Groupings in Japan chart, the multitude of various complex relationships can be seen.19 16 De Mente, Boyd Lafayette, “Monthly Column,” Asia Pacific Management Forum, September 2000, p. 1-3; www.apmforum.com/columns/boye42.htm. See also Deutsch, p. 139. 17 Whiting, Robert, You Gotta Have Wa, New York, Vintage Books, 1989, p. 66. According to De Mente, “wa” incorporates mutual trust between management and labor, harmonious relations among employees on all levels, unstinting loyalty to the company (or team), mutual responsibility, job security, freedom from competitive pressure from other employees, and collective responsibility for both decisions and results: www.apmforum.com/columns/boye11.htm, p.2. 18 Anderson UCLA Grad School Web resource, www.anderson.ucla.edu/research/japan/t4/sup2art.htm. 19 Czinkota, Michael R. and Woronoff, Jon, Unlocking Japan’s Markets,Rutland (VT), Charles E. Tuttle Company, 1991, p. 34. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 75 M o rim u ra M it s u b is h i T o k yu N ip p o n S te e l Tokai T o s h ib a T o y o ta M it s u i O ji I B J M a t s u s h it a S u m it o m o Fuyo N is s a n Kawasaki S e ib u S a is o n D K B M e i ji H it a c h i F u ru k a w a Sanw a S i x m a j o r in d u s t r i a l g r o u p s T o w m e d iu m in d u s t r ia l g r o u p s le d b y le a d in g b a n k s R e la t io n s h ip V e r t ic a l l y i n t e g r a t e d g r o u p s C r e d i t : D o d w e ll M a r k e t i n g C o n s u l t a n t s , I n d u s t r i a l G r o u p i n g s i n J a p a n , 1 9 8 8 , p . 5 . Long-standing relationships are what Japanese society is built upon, from how businesses recruit and keep employees to how business suppliers/manufacturers are selected. “An overriding concern for many Japanese businesses is loyalty and, related thereto, stability. It is therefore regarded as advantageous to create a more structured framework for business activities. This spawns the long-standing and time-honored relationships many Japanese take pride in.”20 Over and over the literature on Japan will quote successful Western businessmen as describing that it will take “”five to 10 years to turn a profit.”21 These relationships go all the way down to the personal level, a sort of business contact net. The Japanese term for this is jinmyaku. In Japanese business, it is often who you know that drives success, not how good your product is. This is changing somewhat now, but these relationships are contrary to similar business relationships in Western countries. 20 21 Ibid. Czinkota and Woronoff, p. 35-36. See also ”Long-Term Perspective,” Czinkota and Woronoff, p. 179-180. Ibid. Deutsch, p. 155. 76 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Cultural differences in product formulation are evident. We missed a crucial attribute for our PCJ study in Japan by omitting “”footprint/size. Because of the limited space in Japanese offices and homes, the size of an appliance or piece of equipment is very important. This can be seen all the way down to the miniature size of personal electronics in Japan. Wireless communication giant DoCoMo has been able to drive the development of tiny wireless Internet phones, working with major manufacturers but branding them with the DoCoMo brand name.22 The packaging of a product in Japan is so important because of the gift-giving culture on both a personal and corporate level. Since many items are given as gifts, the importance of an attractive package with no flaws is perhaps more important than similar goods in the West.23 Czinkota and Woronoff cover the following Japanese product needs: the quality imperative, the importance of price, product holism, reliability and service, and on-time delivery. Another issue that needs to be addressed is that of tailoring the product correctly for Japan. Examples of how equipment needs to be adjusted for Japanese body sizes abound. A company must not use the “one size fits all” mentality with Japan. The sophistication of the consumer to see design or manufacturing flaws is well-documented. This speaks to the issue of product holism. In Japan the product has got to work on all levels—from the way it looks to pleasing colors to being well-designed and to being well-supported by customer service/warranties/repair/parts. Not always does the best product in terms of features win. The durability and reliability of a product is more of a concern in Japan. Customer service has got to match the high standards and expectations. In a department store in Japan, store employees will inundate the customer with offers of assistance. In a department store in the West, the customer may have to actively search for a store employee. The Japanese expectation of service and support is very high. “Japanese customers and channel members expect major service backup, no matter what the cost, with virtually no time delay. Service or sa-bi-su is a highly regarded component of a product and is expected throughout the lifetime of a business relationship.”24 Defective goods are a source of shame; no Japanese company would want to bring such dishonor on the company. Market research in Japan is a topic that hopefully this paper will shed some light upon. Japan spends 9% of the world’s total spent on market research. This number compares to the U.K.’s 9% and the U.S.’s 36%. The biggest spender in worldwide market research dollars is the European Union (including Germany and the U.K.) with approximately 46%.25 Melville gives some information on the comparative market research costs by country: “Japan’s 9% is rather low, considering that research in Japan is more expensive than in any other country. On average, it is about twice the cost of research in Western Europe, and 18% higher than in the U.S.”26 When doing work in Japan, translation is an under-appreciated item. The American Electronics Association Japan (AEA Japan) reinforces a stance on tone/formality/vocabulary in translation found consistently in the literature in the following passage. “If your technology is leading-edge, 22 Rose, Frank, “Pocket Monster,” Wired, September 2001, p. 126-135. Ibid. Melville, p. 134. See also De Mente on Japanese design aesthetics, the words shibui = restrained/refined, wabi=simple/quiet/tranquil, sabi=the beauty of age, yugen=mystery/subtlety. p. 3; www.apmforum.com/columns/boye14.htm. 24 Ibid. Czinkota and Woronoff, p. 178. 25 Ibid. Melville, p. 161. 26 Ibid., Melville, p. 161. 23 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 77 you do not want to go with a discount translation company; they will produce only egregiously embarrassing, nonsensical garbage that will tarnish your company image.”27 Case studies like Budweiser beer’s experience in Japan are fodder for many MBA marketing programs. Judging success is a relative measure. Would Anheuser-Busch rather be the number one imported beer in Japan, or would it rather have a bigger share of the domestic Japanese market which is approximately one billion cases annually (from 1993 numbers)? The Budweiser case study goes back to the licensing agreement it had with Suntory. The problem was that Suntory did not have as strong of a distribution network compared to other domestic Japanese beers. The market share of Budweiser in 1993 was 1.2% of the total Japanese beer market, which equates to approximately 10.1 million cases. In comparison, Budweiser carried a hefty 21.6% market share in the U.S. beer market, which equates to a hefty 330 million cases. In 1993, Anheuser-Busch dismissed Suntory and launched a joint venture with larger brewer Kirin. This was an effort to gain more control of the distribution system on Anheuser-Busch’s part and to also use a stronger Kirin distribution system. The control of advertising was also a point of contention because in the old Suntory agreement all advertising was controlled by Suntory. Suntory did little in the way of advertising and promotion of Budweiser. With the new Kirin joint venture, Anheuser-Busch controlled the important media submissions and had more freedom to experiment in this market.28 Only time well tell how well Budweiser’s efforts in Japan will result. Many foreign companies have failed in Japan for various reasons. The main reasons have been due to: failure to understand the market, impatience, product marketing, long-term business relationships, distribution systems and language/cultural barriers. Proctor & Gamble (P&G) almost failed in Japan, but had the guts and long-term commitment to stay in the market. Bob McDonald, the president of P&G Japan, conveyed the following message to the UC Berkeley Haas business school Japan tour in 1999. “The execution of P&G Japan’s Japanese market entry was a rocky road. At one point, they seriously had to consider writing off their investment and retiring from the Japanese market. They chose to stay and compete with the local competitors, even though it meant having to develop a precise understanding of Japanese consumer needs, their way of doing business, and especially their distribution infrastructure.”29 P&G could leverage its brand name much like the local keiretsu could. Koichi Sonoda, Senior Manager of Corporate Communications at Dentsu (the number one Japanese ad agency), said, “Without reputation or trust of its corporate brand, a company like Proctor and Gamble would have experienced hardship not only in sales performance of their products, but also in dealing with wholesalers, retailers, or even financial institutions. Building trust or a favorable corporate image is the most important factor for success in Japan, whereas in the U.S., when the product brand is well accepted, the corporate brand matters less.”30 P&G did the necessary research with its consumers to identify opportunities. The Joy brand dish detergent that was introduced in 1995 as a concentrated grease fighting soap is a great example of adapting 27 Business in Japan Magazine, “Q&A with the AEA,” p.3; www.japan-magazine.com/1998/sep/zashi/dm5.htm. See also commentary on translation tone/context in Melville p. 121. 28 Ono, Yumiko, “King of Beers wants to rule more of Japan,” The Wall Street Journal, October 28, 1993, p. B1-B6. See also www.eus.wsu.edu/ddp/courses/guides/mktg467x/lesson5.html. 29 McDonald, Bob, “Consumer Market Entry in Japan,” Haas Summer Course; www.haas.berkeley.edu/courses/summer1999/e296-3/trip/japan/pandg.htm. Another fact lost on some companies going into Japan, that there are domestic Japanese products that must be considered competition. 30 Moffa, Michael, “Ad-apting to Japan: a Guide for Foreign Advertisers,” Business Insight Japan Magazine, Nov., 1999, p. 2; www.japan-magazine.com/1999/november/ad-apting.htm. 78 2001 Sawtooth Software Conference Proceedings: Sequim, WA. a product to Japan/Asia. P&G also worked on the packaging with redesigned bottles that saved shelf space, which would no doubt please the retailer and distributors as well as the consumers.31 There are computer usage issues in Japan. The Japanese need more help using computers due to the fact that using computers is a fairly recently learned skill. Computers are up to 25% integration into the home as of 1998.32 Japanese people older than 45 years are not likely to use computers and the aging population will not be computer literate like the younger users are. There has been an operating system bias since it takes time to localize systems and programs into Japanese. Typing on a keyboard is not common; it is harder since the characters are multistroke. Handwriting is more common. There are still double-byte operating system issues. Japanese researchers are still using ACA 3.0 versions which only work on the proprietary NEC operating system—“pain in the NEC.” Lack of double-byte system support for different software packages is a continuing problem. Although the Internet is to some degree replacing the need for some of this localization, according to recent Japanese Internet usage demographics, the growing 19.4 million “wired” citizens entail approximately 25% of the entire population.33 Surprisingly, Japanese schools are not well-equipped or trained to teach computer usage and the Internet. “In Japan, fewer than 70% of the schools actively use the three or so PCs that are connected to the Internet. Of these three, one is used by teachers, and the remaining two by students. If you consider the fact that an average elementary school holds up to 1000 students, the schools are not doing enough in computer education. Some people even joke that teachers lock up the PCs so they won’t have to teach the Internet, because kids may learn faster than the teachers.”34 Statistics reveal that only 20% of Japanese teachers know how to use PCs. The experiences of Japanese Exchange and Teaching (JET) assistant English teachers also reveal some limitations in the way that school is taught. The strictness, old school methods and regimentation of the Japanese education are well-documented. INTERPRETATIONS/CONCLUSIONS These were real research projects conducted in Japan. They are not academically configured with corresponding research projects in America or Western countries. Based on the Japan/America Cross-Cultural Research Study, there is evidence that Japanese people tend to select answers from the centroid rather than from the outliers. To similar and lesser degrees this is backed up by the other Japanese-only studies conducted. The tendency to select from the centroid leads to significant differences (in absolute values) between the priors utilities and the pair utilities. However, the shape of Japanese data based on Euclidean distances is more consistent. 31 Ibid. Fields et al., p. 18-19. Ibid. Melville, p. 175. 33 Japan Inc Magazine, “Selling to Japan Online,” p. 2; www.japaninc.net/mag/comp/2000/10/print/oct2000p_japan.html. 34 Sekino, Nicky, “Compaq’s Murai Challenges Info Age Japan,” Business Insight Japan Magazine, 1998, p. 3; www.japan-magazine.com/1998/nov/bijapan-www/zashi/dm4.htm 32 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 79 In the U.S., respondents know what they like and don’t like in a matter of absolutes, but not as much on the points in between—a basic confrontational American way of thinking. In Japan, respondents put more careful thought into the choices and don’t change their minds. The pairs match the priors in terms of shape. This does not mean that ACA is not effective. Additional cluster analysis showed no significant differences in final utilities between centroid people and outlier people. ACA combines final utilities using data/weighting from the priors and the pairs. For studies conducted in Japan alone, utilities will hold up fine. In comparing ACA data from Japan to ACA data from Western countries, the researcher should be aware of the centroid issue and process the data with this in mind. Weighting methods and using holdouts will provide ways to tune the data to work across all countries. Surprisingly, for brand/no brand in the pairs comparisons, we did not find any consistent difference in the assignment of centroid or outlier based on the answers to the paired comparisons. Once again, these studies did not all include brand as an attribute, so we only have the data we have shown. On holdouts prediction, we feel that brand does play a part in the holdout selections/predictions we have seen. However, Nikkei Research cautions that so much more goes into the product choice in Japan (packaging, color, size, distribution method, etc.). This study was a very good learning and teamwork experience for us as researchers as representatives from both the U.S. and Japan. The Japanese reluctance to address cultural issues and make bold statements on many occasions is contrasted by the U.S. proclivity to do just that. This is a basic difference between our cultures and important to keep in mind when undertaking any research project in Japan. 80 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Appendix A Priors vs. Pairs Utilities: We’ve tried to show, where possible, the brand and pricing attributes. n=307 Study 1: Priors Pairs Attribute 1 Level 1 14.54* 2.40* Attribute 4 Level 3 Attribute 1 Level 2 19.35* 16.07* Attribute 4 Level 4 Attribute 1 Level 3 28.71* 23.64* Attribute 4 Level 5 Attribute 1 Level 4 21.24 22.73 Attribute 4 Level 6 Attribute 1 Level 5 17.17* 24.99* Attribute 5 Level 1 Attribute 2 Level 1 11.29* 4.71* Attribute 5 Level 2 Attribute 2 Level 2 19.13* 14.54* Attribute 5 Level 3 Attribute 2 Level 3 26.89* 17.44* Attribute 5 Level 4 Attribute 2 Level 4 18.78* 14.22* Attribute 6 “Cost” Level 1 Attribute 2 Level 5 16.35* 10.40* Attribute 6 “Cost” Level 2 Attribute 3 Level 1 16.58* 11.52* Attribute 6 “Cost” Level 3 Attribute 3 Level 2 14.35* 4.35* Attribute 6 “Cost” Level 4 Attribute 3 Level 3 20.33* 11.78* Attribute 6 “Cost” Level 5 Attribute 3 Level 4 19.28* 11.13* Attribute 3 Level 5 21.30* 16.02* Attribute 4 Level 1 23.48* 50.08* Attribute 4 Level 2 29.43* 50.37* * Significant difference at 95% confidence level. n=123 Study 2: Attribute 1 “Cost” Level 1 Attribute 1 “Cost” Level 2 Attribute 1 “Cost” Level 3 Attribute 1 “Cost” Level 4 Attribute 2 Level 1 Attribute 2 Level 2 Attribute 3 Level 1 Attribute 3 Level 2 Attribute 4 Level 1 Attribute 4 Level 2 Attribute 5 Level 1 Attribute 5 Level 2 Attribute 5 Level 3 Attribute 5 Level 4 Attribute 6 Level 1 Attribute 6 Level 2 Attribute 6 Level 3 Attribute 6 Level 4 Priors Pairs 38.25 44.54 33.01* 47.50* 27.59* 36.79* 14.63* 8.93* 18.87* 4.76* 42.87* 26.00* 11.48* 2.22* 55.52* 35.29* 3.20* 0.34* 66.88* 54.37* 11.64 8.67 27.25 28.19 52.94* 79.44* 67.53* 93.55* 8.16 6.17 38.05 38.36 46.92* 32.92* 50.77 45.15 Attribute 7 Level 4 Attribute 8 Level 1 Attribute 8 Level 2 Attribute 9 Level 1 Attribute 9 Level 2 Attribute 9 Level 3 Attribute 9 Level 4 Attribute 10 Level 1 Attribute 10 Level 2 Attribute 11 Level 1 Attribute 11 Level 2 Attribute 11 Level 3 Attribute 12 Level 1 Attribute 12 Level 2 Attribute 12 Level 3 Attribute 13 Level 1 Attribute 13 Level 2 Attribute 14 Level 1 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Priors Pairs 31.32* 48.11* 23.72* 35.21* 12.75* 16.72* 2.54 2.75 15.37* 4.26* 21.86* 8.12* 20.06* 7.93* 18.26* 9.62* 36.71* 55.90* 36.16* 47.97* 27.59* 37.77* 14.04* 19.10* 1.41* 0.16* Priors Pairs 59.06* 41.25* 14.19* 9.48* 56.04* 29.65* 6.39 4.88 31.00* 38.00* 52.18* 37.42* 54.23* 46.71* 1.65 0.82 57.89* 49.27* 2.19 0.48 55.78* 83.94* 62.48* 94.62* 1.79 0.70 43.98* 75.05* 69.89* 95.50* 3.45* 0.31* 71.34* 90.78* 5.03 2.88 81 n=123 (con’t) Study 2: Priors Pairs Attribute 7 Level 1 5.22* 17.37* Attribute 7 Level 2 27.77* 21.13* Attribute 7 Level 3 43.70* 19.85* * Significant difference at 95% confidence level. Priors Pairs Attribute 14 Level 2 59.20* 46.71* n=107 Study 3: Priors Attribute 1 “Brand” Level 1 45.11* Attribute 1 “Brand” Level 2 25.06* Attribute 1 “Brand” Level 3 46.68* Attribute 1 “Brand” Level 4 16.71* Attribute 2 Level 1 17.17* Attribute 2 Level 2 50.12* Attribute 3 Level 1 24.62 Attribute 3 Level 2 40.26* Attribute 4 Level 1 36.94 Attribute 4 Level 2 47.49* Attribute 4 Level 3 13.72 Attribute 5 Level 1 38.99 Attribute 5 Level 2 30.44* Attribute 5 Level 3 39.50* Attribute 5 Level 4 27.21* * Significant difference at 95% confidence level. Pairs 67.42* 43.28* 76.10* 30.92* 7.70* 35.22* 20.91 14.89* 40.48 20.59* 16.77 41.65 22.38* 26.30* 35.38* n=185 Study 4: Priors Attribute 1 “Brand” Level 1 36.53 Attribute 1 “Brand” Level 2 28.34* Attribute 1 “Brand” Level 3 51.03* Attribute 1 “Brand” Level 4 28.14* Attribute 2 Level 1 17.73* Attribute 2 Level 2 57.40 Attribute 3 Level 1 36.38* Attribute 3 Level 2 51.16* Attribute 3 Level 3 28.99* Attribute 4 Level 1 3.58* Attribute 4 Level 2 77.10* Attribute 5 Level 1 6.93* Attribute 5 Level 2 76.70 * Significant difference at 95% confidence level. Pairs 43.69 40.41* 88.62* 75.70* 5.12* 48.80 19.55* 27.15* 31.37* 6.54* 40.22* 1.50* 71.33 82 2001 Sawtooth Software Conference Proceedings: Sequim, WA. n=85 Study 5: Priors Pairs Attribute 1 Level 1 26.07* 7.65* Attribute 5 “Price” Level 6 Attribute 1 Level 2 28.89* 22.38* Attribute 5 “Price” Level 7 Attribute 2 Level 1 33.76 34.35 Attribute 2 Level 2 9.69* 1.14* Attribute 2 Level 3 32.82 29.90 Attribute 3 Level 1 21.94 18.74 Attribute 3 Level 2 20.24* 6.12* Attribute 3 Level 3 33.92* 19.99* Attribute 4 Level 1 24.29* 34.38* Attribute 4 Level 2 5.53* 1.45* Attribute 4 Level 3 36.78* 48.52* Attribute 4 Level 4 42.97* 68.34* Attribute 5 “Price” Level 1 9.58 12.00 Attribute 5 “Price” Level 2 22.52 18.68 Attribute 5 “Price” Level 3 32.60* 14.25* Attribute 5 “Price” Level 4 38.77* 24.13* Attribute 5 “Price” Level 5 37.27* 30.38* * Significant difference at 95% confidence level. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Priors Pairs 27.48* 42.83* 14.87* 64.76* 83 Appendix B Additional frequency distributions from the other studies, obviously this is aggregate level analysis and does not address the possibility of individual respondents being centroid oriented or outlier oriented. Also, different product categories could have an effect on the results and shape of the data. Frequency Distribution of Study 1—37 pairs x 333 respondents 1 - 15.4% 2 - 7.4% 3 - 12.7% 4 - 10.5% 5 - 14.5% 6 - 10.1% 7 - 11.5% 8 - 6.4% 9 - 11.5% Frequency Distribution of Study 2—35 pairs x 123 respondents 1 - 5.8% 2 - 8.9% 3 - 14.3% 4 - 12.7% 5 - 12.3% 6 - 13.2% 7 - 15.8% 8 - 8.5% 9 - 5.3% Frequency Distribution of Study 3—12 pairs x 107 respondents 1 - 4.2% 2 - 5.9% 3 - 13.1% 4 - 15.0% 5 - 16.3% 6 - 15.2% 7 - 16.9% 8 - 9.6% 9 - 3.6% Frequency Distribution of Study 4—8 pairs x 185 respondents 1 - 3.8% 2 - 8.4% 3 - 12.6% 4 - 14.7% 5 - 17.8% 6 - 15.7% 7 - 14.9% 8 - 7.0% 9 - 5.0% Frequency Distribution of Study 5—16 pairs x 85 respondents 1 - 19.2% 2 - 5.3% 3 - 7.8% 4 - 10.1% 5 - 13.7% 6 - 9.8% 9 - 19.0% Thanks To Nikkei Research Clients for data sets Chuck Neilsen Thunderbird American Graduate School of International Management Aaron Tam Don Milford Hiroyuki Iwamoto Teri Watanabe 84 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 7 - 9.6% 8 - 5.4% A METHODOLOGICAL STUDY TO COMPARE ACA WEB AND ACA WINDOWS INTERVIEWING Aaron Hill & Gary Baker Sawtooth Software, Inc. Tom Pilon TomPilon.com Over the past two decades, Adaptive Conjoint Analysis (ACA) has gained popularity and acceptance in the market research community. Due to the interactive nature of the questionnaire, ACA surveys must be conducted using computers. In the past, many researchers have relied on Sawtooth Software’s DOS-based ACA program. With Sawtooth Software’s introduction of two new ACA products (ACA/Web, released in December of 2000, and ACA version 5.0 for Windows, released November 2001), researchers can now conduct bimodal ACA studies, with respondents self-selecting into either a disk-by-mail (CAPI) or an online survey implementation. This development could allow researchers to conduct ACA surveys in a greater number of settings, reaching more respondents with more convenient survey tools and at lower costs. However, researchers conducting bimodal surveys must be confident that the results from the two survey methods are comparable, and that one method does not introduce error or bias not present in the other method. There is an inherent risk that two survey methods would differ in ways that would make combining and analyzing the resulting data sets unacceptable. While both of the new Sawtooth Software survey instruments were developed using the same underlying programming code, differences exist that could potentially introduce some bias. Theoretically, results for both methods should be almost exactly the same with respect to both implementation and results. This pilot research project suggests that any differences between the two survey methods are minimal and that it is acceptable to combine the results from the two modalities. BACKGROUND Adaptive Conjoint Analysis was first developed by Richard Johnson in 1985 as a way for researchers to study a larger number of attributes than was generally thought prudent with standard conjoint analysis methods. By collecting initial data from respondents, tradeoff tasks could be designed that efficiently collected information regarding the most relevant tradeoffs for each respondent. By combining a self-explicated priors section with customized tradeoff questions, ACA is able to capture a great deal of information in a short period of time. Over the years, Johnson has added new capabilities to his original process, leading to better methods for combining priors and pairs utilities. Additionally, Johnson’s recent application of hierarchical Bayes estimation for ACA has improved the predictive accuracy of ACA utilities even further. The ACA conjoint model has steadily gained favor over the years due to its effectiveness in researching products and services with large numbers of attributes or with high customer involvement. Additionally, respondents in one survey indicated that ACA studies seem to take less time and are more enjoyable than comparable full profile conjoint studies (Huber et al., 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 85 1991). ACA is now “…clearly, the most used commercial conjoint model…” (Green et al., 2000). ACA has other important characteristics as well. The individual-level utility calculations allow researchers to segment respondent populations and generally draw more precise conclusions about market preferences than conjoint methods based on aggregate utility estimation. This is particularly true in cases where a high degree of heterogeneity exists. ACA also tends to be less prone to the “number of levels effect,” where attributes represented on more levels tend to be biased upward in importance relative to attributes represented on fewer levels. Finally, ACA is generally robust in the face of level prohibitions within the paired tradeoff section and significantly reduces troublesome utility reversals (e.g. a higher price having a higher utility than a lower price) relative to traditional conjoint methods. ACA is not necessarily the best conjoint method for all situations, however. The interview format forces respondents to spread their attention over many attributes, resulting in a tendency to “flatten out” relative importance scores when many attributes are studied. ACA interviews also require respondents to keep an “all else being equal” mindset when viewing only a subset of the full array of attributes. Both of these factors tend to reduce ACA’s effectiveness in measuring price attributes, often leading to substantially lower importance (price elasticity) estimates than are actually present in the market. ACA surveys also must be administered using computers, which rules out its use in some interviewing situations. It has no compelling advantages over other conjoint methods when smaller numbers of attributes (about six or seven or fewer) are being studied. SURVEY EXPERIMENTAL DESIGN Objective The primary purpose of our study was to validate the hypothesis that there are no substantial differences between the Web and Windows versions of ACA that would make utilities from one incompatible with utilities from the other. That is, all other things held equal, the method used to implement an ACA survey (either Web or disk) should have no impact on the final utilities or results. This hypothesis is based on several underlying assumptions. The first assumption is that the design algorithm used to create each individual survey is fundamentally the same between methods. Secondly, any differences in survey look and feel can be minimized. Finally, the hypothesis assumes that there is minimal “survey modality” self-selection bias. Our secondary reason for conducting this research was to see how respondents reacted to the different methods. We were particularly interested in finding whether perceptions differed with regard to how long the surveys took, whether respondents felt that they were able to express their opinions, and whether one was more enjoyable or more realistic than the other. Design Considerations To test the primary hypothesis, Web and Windows versions of the same ACA survey were developed and administered to 121 students at three universities. Each respondent was randomly assigned to take one version or the other, so that the numbers of respondents for each mode were roughly equal at each university. The first version (“Web”) was created and administered using 86 2001 Sawtooth Software Conference Proceedings: Sequim, WA. the ACA/Web program, with respondents gaining access to the survey over the Internet through a password-protected site. The second version (“Windows”) was created using ACA/Windows and administered, wrapped within a Ci3 questionnaire, using a CAPI approach. Students completed the survey on PCs and returned the disks to their instructor. The survey asked about features of notebook computers, with utilities measured for two to six levels for each of ten attributes (Appendix 1). Seven of the ten attributes were a priori ordered, while the remaining three had no natural order of preference. Notebook computers were chosen because we believed that the majority of the respondents would be familiar enough with the subject to complete the survey and because notebook computers involve a large number of attributes that could affect purchase decisions. To control for the underlying assumptions, consideration was given to several survey design and implementation factors. First, both versions of the questionnaire were evaluated and tested prior to launch to ensure that they generated logical surveys and reasonable utilities. After collecting the field data, the joint frequency distributions between levels were measured to confirm that each attribute/level combination occurred with roughly equal frequency. One unexpected benefit of this testing was that we found, and were subsequently able to correct, a small error in the way that the Web software generated the tradeoffs in the pairs section. Incorrect random number seeding led to similarity in the initial pairs questions for different respondents, which led to undesirable level repetition patterns across (but not within) respondents. Fortunately, there is no evidence that this problem had any practical effect. We also went to great lengths to make both surveys look and feel similar. Wherever possible, question and instruction text was written using exactly the same wording, and the font and color schemes were standardized across interfaces. This ensured that we minimized the differences that were not interface-specific. Self-selection bias was controlled by randomly assigning participants to one version or the other, a practice that would not normally occur in real world research situations. In situations where researchers are interested in performing bimodal surveys such as this, it is highly recommended that they take appropriate steps to ensure that the two surveys are as similar as possible. Even seemingly minor differences can lead to negative consequences, as is the case if rating scales are dissimilar, if instruction text is slightly different, or if survey design is not exactly the same (i.e. one survey specifies an a priori ordered attribute where the other does not). SURVEY VALIDATION Holdout Tasks Results from the study were validated through the use of four partial-profile holdout choice tasks. Three product concepts described using three of the ten attributes were displayed per choice task. Across the four choice tasks, all attributes were displayed. The four holdout choice tasks were repeated (with concept order rotated within each task) to measure test/retest reliability (see Appendix 4). Holdouts were designed based on initial utility estimates derived from a small test study. The first task was designed to be relatively easy (predictable), with one choice dominating the others. Two of the remaining three tasks were designed with varying degrees of choice difficulty 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 87 (predictability), while the fourth task had three choices with roughly equal average utilities, making the choice task more difficult for respondents to answer (and for us to predict). It is important for researchers to incorporate holdout tasks in surveys. Well-constructed holdout tasks are extremely useful for a variety of reasons. In our study, these holdouts were used to control for differences in respondent test/retest reliability between the two design treatments, determine the correct scaling “exponent” in the market simulator, measure the accuracy of the various utility estimation techniques, and to identify inconsistent respondents. Researchers can also use holdout tasks to test specific product concepts of interest to clients, provide concrete validation of the conjoint analysis process, identify potential errors in survey design or data processing, and to determine which utility estimation method results in the most accurate predictions for a specific data set. (Note: For additional information on designing holdout tasks, please refer to Appendix 5.) Method Comparison We began our analysis by computing utilities for the Web and Windows respondent groups using several different estimation techniques (including priors only, OLS regression and Hierarchical Bayes). This allowed us to examine whether different techniques might identify incongruencies between the two survey methods. Again, our hypothesis was that the results from the two survey methods should be the same regardless of analysis technique. Using these sets of utilities and the results of the holdout tasks, we calculated “hit rates” for each respondent. These hit rates measured how often an individual’s utilities were able to successfully predict his or her actual responses to the holdouts. The hit rates were then compared to each individual’s test/retest reliability, a measure of how often respondents were able to give the same answer to a repeated holdout task. We adjusted the hit rates to account for the internal consistency of the respondents within each group by dividing each respondent’s actual hit rate by his or her retest reliability score. This “adjusted hit rate” represents the ratio of the number of times a choice was accurately predicted by the utilities to the number of times a respondent was able to accurately repeat his or her previous choices. Since our choice tasks consisted of three alternatives, we would expect random responses to achieve a mean test/retest reliability of 33%. To validate the results of the survey, we first compared the adjusted hit rates within each utility calculation method to determine how well the different survey methods were able to predict the individual holdout choices. We then performed significance testing (t-test) to determine if small amounts of hit rate variation between the two methods were statistically significant for any of the utility calculation methods. A lack of statistically significant differences would seem to support our hypothesis that the Web and Windows versions would produce comparable results. We also wanted to see whether the two versions produced similar results when used in the market simulator. In other words, would the difference between predicted and actual shares of choice for the holdouts be similar between survey methods? To determine whether the two methods produced similar results, we computed the mean absolute error (MAE) between the aggregate share of preference reported by the market simulator and the actual choice probability for each alternative represented in the holdout tasks. By averaging the actual holdout choices for 88 2001 Sawtooth Software Conference Proceedings: Sequim, WA. both groups, comparisons between groups can be made using a standard benchmark. Using the market simulator, we computed shares of preference for each holdout task and tuned the “exponent” to minimize the mean absolute error between the estimated share of preference and the average holdout probabilities. It is critical to tune the exponent to minimize the MAE so that the true predictive ability of the utilities can be assessed independent of the differences caused by utility scaling. FINDINGS AND CONCLUSIONS It should be noted that this project is only intended as a pilot study. Further research would be useful to confirm our results, particularly since the sample size was relatively small and the target population was less than representative of typical market research subjects. Finding #1: There were no significant differences in how the two groups reacted to the survey process. When asked about the survey experience, respondents indicated that both versions of the survey were equal in every respect. There were no significant differences in ratings on the items reported in Table 1, and open-ended responses were very similar between the two versions. Qualitative Measures Respondents were asked to rate their respective surveys on seven descriptive characteristics. In both cases, significant differences between the two methods were nonexistent, and were comparable to a previous study involving ACA for DOS (Huber et al., 1991). Table 1: Qualitative ratings of survey methods Qualitative Responses "How much do you agree or disagree that this survey: "…was enjoyable" "…was easy" "…was realistic" "…allowed me to express my opinions" "…took too long" "…was frustrating" "…made me feel like clicking answers just to get done" Web Windows '91 Huber Study 5.8 6.7 6.8 6.4 4.4 3.2 3.6 6.0 6.6 6.6 6.3 4.0 3.2 3.9 5.8 6.6 6.3 6.4 3.9 3.3 3.4 T-test* -0.6 0.1 0.7 0.2 0.8 0.0 -0.6 * T-test compares results between the Web and Windows versions. Time to Complete The perceived time was significantly lower in both cases than the actual time to complete the survey, a finding that has been illustrated in previous studies. One of the potential drawbacks to using a Web interface is the potential for slow connections to significantly increase the amount 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 89 of time it takes to complete a survey. In this study, there was not a significant difference in the survey completion times between the two methods (both took about 18 minutes) (See Table 2). Table 2: Time to complete interview Time Differences Perceived time to complete survey Actual time to complete survey Finding #2: Web 15.2 17.5 Windows 13.7 18.1 T-test 1.7 -0.3 There did not appear to be a significant difference between the Web utilities and the Windows utilities. Utilities were calculated using the standard Ordinary Least Squares (OLS) estimation procedure included with ACA, and then normalized as “zero-centered diffs.” Significance testing of these utilities revealed that there were no statistically significant differences (at the 95% confidence level) between the importance scores for the two methods (Appendix 2). Although some of the differences in average utility values for the two methods appear to be different (Appendix 3), they are few in number (6 out of 36 with t>2.0) and their true significance is not easily assessed because of their lack of independence. More telling, the correlation between the resulting average utilities was 0.99. This leads us to conclude that for all practical purposes there seems to be no difference between the utilities estimated by the two methods. Finding #3: Priors (“self-explicated”) utility values did as well as OLS utilities with respect to hit rates for this data set. One of the more interesting findings in this study was that the self-explicated section of ACA appeared to do just as well at predicting holdouts (in terms of hit rates) as the ACA Ordinary Least Squares utilities, which incorporated the additional information gathered in the pairs section. Normally, one would expect that the additional pairs information would increase the likelihood of accurately predicting holdout tasks. However, in this case, it appears that the pairs section did not improve the hit rates achieved using the OLS utilities. It should also be noted that utilities estimated using an ACA/Hierarchical Bayes run that incorporated just the pairs information (not the default HB method) achieved adjusted hit rates that were very similar to the self-explicated results. This suggests that each section alone contributes approximately equal information, while the addition of the other section marginally adds to the predictive ability of the model. Finding #4: For both groups, Hierarchical Bayes estimation achieved significantly higher adjusted hit rates. Using ACA/HB, we used the pairs data to estimate the utilities and constrained the estimates to conform to the ordinal relationships determined in the priors section. With both data sets, the Hierarchical Bayes utilities outperformed both the OLS utilities and the priors only utilities, achieving an adjusted hit rate of 81% for the Web and 91% for Windows (See Table 3). 90 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 3: Adjusted Hit Rates for Different Utility Estimation Techniques Finding #5: Web Win t-test Test-retest reliability 84.5% 86.1% -0.5 Adjusted hit rates ACA OLS analysis Prior utilities only HB (constrained pairs) HB (unconstrained pairs) 72.8% 76.0% 80.9% 74.9% 82.7% 85.0% 90.6% 84.1% -1.9 -1.5 -1.5 -1.5 This study (and many others) suggests that well-designed self-explicated surveys may perform quite well in some situations. In the past several years, researchers have debated the value of asking paired tradeoff questions in ACA when seemingly good results can often be obtained using a well-designed selfexplicated questionnaire. This study seems to lend some credibility to the use of self-explicated models, but this finding comes with several important caveats. First, past comparative studies between the two models have generally compared ACA with well-developed self-explicated questionnaires, which are more suited for stand-alone use than the simple question flow used in the priors section of ACA. For instance, ACA allows researchers to specify a priori attribute orders, which assume linearity in the priors. In a self-explicated survey, researchers would want to use a rating-type scale to estimate the differences in preference between these levels. As a result, one cannot necessarily conclude that a study consisting only of ACA priors is a reasonable substitute for a well-designed self-explicated exercise. Additionally, researchers cannot know until after the fact whether a “self-explicated only” survey would have produced acceptable results. It is often only after the ACA results are analyzed that the appropriate survey mechanism can be determined and the value of the pairs assessed. While the pairs questions in our survey may have contributed only marginally to the results achieved through OLS estimation, they were essential in achieving the superior results obtained through HB analysis. This leads us to conclude that if using ACA, one should always include the paired tradeoff section unless budgetary or time constraints prevent their use. In these cases, we would recommend the use of a more rigorous form of self-explicated questionnaire than is available in the priors section of ACA. Finding #6: Now that we have Hierarchical Bayes estimation for ACA, the inclusion of the Calibration Concepts section may be a wasted effort in many ACA studies. Another significant finding confirms past observations regarding the value of the calibration concepts section of ACA. Calibration concepts are used to scale individual utilities for use in purchase likelihood simulation models. They are also used to identify inconsistent respondents. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 91 They have been an integral part of ACA since its inception, but are often regarded by respondents as being confusing. In the market simulation validation described previously, the uncalibrated utilities had lower mean absolute errors than the calibrated utilities (See Table 4). Additionally, the calibrated utilities required significantly large exponents to minimize the error in share of preference simulations. In most cases where calibrated utilities are used in market simulations, our experience indicates that the exponent (scale factor) will need to be set at around five to maximize the fit of holdout choice tasks. With the introduction of ACA/Hierarchical Bayes, calibration concepts will be needed in fewer ACA studies. Hierarchical Bayes analysis automatically scales the utilities for use in most market simulation models, although additional tuning of the exponent may still be required. It also calculates a goodness of fit measure that is probably a better estimate of respondent consistency than ACA’s correlation figure, which is based on the calibration concepts. Unless the researcher requires the purchase likelihood model, there is no need to ask the calibration concept questions. Finding #7: The Windows and Web ACA groups did not differ either in terms of the holdout “hit rates” or the predictive accuracy of holdout shares. The final finding, and probably the most important, is that we did not find any significant differences in the final results between the two versions. Both versions had very similar adjusted hit rates within utility estimation treatments. In each case, adjusted holdout hit rates improved when Hierarchical Bayes utility estimation was used (see Table 3). We also examined the mean absolute errors (MAE) of each method when comparing market simulation results with actual choice probabilities from the holdout tasks. While the mean absolute errors varied between utility estimation methods, the MAEs of each survey method were very similar for each estimation technique (See Table 4). Once again, we found that the Hierarchical Bayes utility estimates provided improvements over other utility estimation methods, achieving the lowest MAEs. Table 4: Mean Absolute Errors between Market Simulation Share of Preference and Observed Choices from Holdout Tasks ACA Utility Run (OLS) Prior Utilities HB Utilities HB Utilities (Calibrated) 92 Web Windows MAE 7.15 7.06 5.86 6.21 MAE 6.60 7.14 6.45 6.51 2001 Sawtooth Software Conference Proceedings: Sequim, WA. CONCLUSIONS Our study suggests that dual mode surveys, using both ACA/Web and ACA for Windows, are an acceptable option for conducting research. The results from the two seem not to differ substantially, meaning that you should be able to run dual mode studies and combine the final utility data. Secondary findings seem to mirror previous research with respect to the value of selfexplicated utilities, and the additional predictive ability added when using HB. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 93 Appendix 1: Attributes & Levels Included in Notebook Survey Brand Weight Screen Size (a priori – best to worst) Battery Processor (a priori – worst to best) (a priori – worst to best) Weighs 6.3 pounds Weighs 7.0 pounds Weighs 7.7 pounds Weighs 8.4 pounds 15.7" Screen Size 15.0" Screen Size 14.0" Screen Size 12.0" Screen Size 10.4" Screen Size 2 Hour Battery Life 3 Hour Battery Life 4 Hour Battery Life 600 Mhz Processor Speed 700 Mhz Processor Speed 800 Mhz Processor Speed Hard Drive Memory Pointing Device Warranty Support (a priori – worst to best) (a priori – worst to best) (a priori – worst to best) (a priori – best to worst) 10 GB Hard Drive 20 GB Hard Drive 30 GB Hard Drive 1 yr. Parts & Labor Warranty 2 yr. Parts & Labor Warranty 3 yr. Parts & Labor Warranty 24 hrs/day 7 days/week Toll Free Telephone Support 13 hrs/day 6 days/week Toll Free Telephone Support IBM Compaq Toshiba Dell Sony Winbook 64 MB RAM 128 MB RAM 192 MB RAM 256 MB RAM Touchpad Pointing Device Eraser Head Pointing Device Touchpad & Eraser Head Pointing Device Appendix 2: Importance Scores Derived from Web and Windows Utilities Importance RAM Screen Brand Battery Weight Speed Hard Drive Warranty Pointing Device Support 94 Web Windows 13.2% 12.7% 13.2% 12.9% 11.3% 12.6% 10.4% 9.3% 10.2% 9.7% 9.9% 8.5% 9.7% 8.8% 9.2% 10.1% 7.2% 8.3% 5.7% 7.1% t-stat 0.81 0.38 -1.80 1.64 0.57 1.87 1.34 -1.09 -1.33 -1.83 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Appendix 3: Average Utility Scores for Web and Windows Versions of ACA (Represented as “zero-centered diffs”) Web Win t-stat Brand IBM Compaq Toshiba Dell Sony Winbook 19.9 -1.3 -6.6 14.8 13.3 -40.1 22.0 2.1 -4.8 12.2 21.5 -53.1 -0.3 -0.5 -0.3 0.3 -1.1 2.3 Weight 6.3 lbs 7 lbs 7.7 lbs 8.4 lbs 43.5 23.0 -16.9 -49.5 41.6 11.3 -8.5 -44.3 0.4 3.2 -2.3 -1.0 Screen Size 15.7" screen 15" screen 14" screen 12" screen 10.4" screen 35.9 38.8 13.7 -28.3 -60.1 46.2 33.5 14.2 -32.5 -61.4 -1.4 1.0 -0.1 0.7 0.2 Battery Life 2 hr. battery 3 hr. battery 4 hr. battery -52.5 1.8 50.8 -43.7 -2.0 45.7 -2.2 1.2 1.2 Processor Speed 600 Mhz processor 700 Mhz processor 800 Mhz processor -48.4 -1.2 49.5 -40.5 -0.1 40.6 -2.0 -0.3 2.1 Web Win t-stat -49.6 4.3 45.4 -43.6 3.3 40.3 -1.5 0.3 1.3 Random Access Memory (RAM) 64 MB RAM -67.3 128 MB RAM -17.4 192 MB RAM 22.4 256 MB RAM 62.3 -64.1 -15.2 22.4 56.9 -0.7 -0.6 0.0 1.3 Pointing Device Touchpad Eraser head Both -8.0 -10.4 18.5 -1.4 -18.1 19.5 -1.0 1.1 -0.2 Warranty 1 year warranty 2 year warranty 3 year warranty -43.1 -3.9 47.0 -49.5 1.6 47.9 1.3 -1.6 -0.2 Technical Support 24/7 tech support 13/6 tech support 27.7 -27.7 34.7 -34.7 -1.7 1.7 Hard Drive Capacity 10 GB hard drive 20 GB hard drive 30 GB hard drive 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 95 Appendix 4: Holdout Choice Task Design * Holdout 5 = Holdout 1 (shifted); Holdout 6 = Holdout 2 (shifted); etc. Holdout 1 Holdout 2 Holdout 3 Holdout 4 Holdout 5* Holdout 6* Holdout 7* Holdout 8* Concept A Toshiba 15" Screen 10 GB Hard Drive IBM 600 Mhz Processor 128 MB RAM 2 hour battery Touhpad & Eraserhead 3 year warranty Weighs 6.3 lbs. 10 GB Hard Drive 13/6 tech support Winbook 12" Screen 20 GB Hard Drive Dell 800 Mhz Processor 64 MB RAM 4 hour battery Touchpad 1 year warranty Weighs 7.7 lbs. 20 GB Hard Drive 24/7 tech support Concept B Sony 14" Screen 30 GB Hard Drive Compaq 700 Mhz Processor 256 MB RAM 3 hour battery Eraserhead 2 year warranty Weighs 8.4 lbs. 30 GB Hard Drive 13/6 tech support Toshiba 15" Screen 10 GB Hard Drive IBM 600 Mhz Processor 128 MB RAM 2 hour battery Touhpad & Eraserhead 3 year warranty Weighs 6.3 lbs. 10 GB Hard Drive 13/6 tech support Concept C Winbook 12" Screen 20 GB Hard Drive Dell 800 Mhz Processor 64 MB RAM 4 hour battery Touchpad 1 year warranty Weighs 7.7 lbs. 20 GB Hard Drive 24/7 tech support Sony 14" Screen 30 GB Hard Drive Compaq 700 Mhz Processor 256 MB RAM 3 hour battery Eraserhead 2 year warranty Weighs 8.4 lbs. 30 GB Hard Drive 13/6 tech support Appendix 5: Guidelines for Creating Effective Holdout Tasks For years, Sawtooth Software has recommended that researchers include holdout tasks in conjoint studies. Holdout tasks, when designed appropriately, can provide invaluable information as researchers implement studies and analyze conjoint results. Holdout tasks provide an effective tool for: 96 identifying/removing unreliable respondents measuring respondent test/retest reliability checking that split samples indeed reflect a similar composition of reliable/unreliable respondents allowing researchers to adjust the samples’ reliability (by weighting respondents or by indexing hit rates to within-sample test-retest reliability) comparing the predictive ability of different utility estimation models (ACA Windows vs. Web, HB versus OLS, etc.) testing specific product concepts of particular interest identifying potential errors in conjoint design or data processing tuning market simulation models 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Designing Holdout Tasks For methodological research, we suggest that you include four or more holdouts, repeated at some point in the interview. For commercial research, two or three holdout tasks may be adequate. Below are some suggestions for designing effective holdout tasks. It’s a good idea to pretest your conjoint survey before fielding your questionnaire. In addition to identifying problem areas, these pretests can also be used to gather preliminary utility estimates. These preliminary estimates are useful in designing well-balanced holdout choice tasks. Holdout choice tasks should be constructed with the goal of providing the maximum amount of useful information. Use preliminary utility estimates to design holdout tasks based on the following principles: ♦ If the choice task is very obvious (nearly everybody chooses same concept), the holdout task is less valuable because the responses are so predictable. ♦ If the choice task is too well balanced (nearly equal probabilities of choice for concepts), the holdout task is less valuable because even random utilities will predict preference shares well. ♦ The best strategy is to have multiple holdouts with varying degrees of utility balance. This allows you to test extreme conditions (very easy and very difficult), and provides a richer source by which to test the success of various utility estimation techniques, simulation models (including tuning) or experimental treatments. ♦ Choice tasks should include a variety of degrees of product differentiation/similarity to ensure that the simulation technique can adequately account for appropriate degrees of competitive or substitution effects. In this study, we designed holdouts to include relatively dominated to relatively balanced sets. The table below shows the actual choice probabilities for the sample (n=121). Choice probabilities (test and retest collapsed)* Holdout1 Holdout 2 Holdout3 Holdout4 Alternative A 13% 18% 31% 18% Alternative B 85% 72% 38% 25% Alternative C 2% 10% 31% 57% * Actual respondents chose the same concept 85% of the time between test and retest (Windows group: 86%; Web group: 84%) 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 97 REFERENCES Huber, Joel C., Dick R. Wittink, John A. Fiedler, and Richard L. Miller (1991), “An Empirical Comparison of ACA and Full Profile Judgments”, in Sawtooth Software Conference Proceedings, Sun Valley, ID: Sawtooth Software, Inc., April, 189-202. Green, Paul E. (2000), “Thirty Years of Conjoint Analysis: Reflections and Prospects”, forthcoming in Interfaces. 98 2001 Sawtooth Software Conference Proceedings: Sequim, WA. INCREASING THE VALUE OF CHOICE-BASED CONJOINT WITH “BUILD YOUR OWN” CONFIGURATION QUESTIONS David Bakken, Ph.D. Senior Vice President, Harris Interactive Len Bayer Executive Vice President and Chief Scientist Harris Interactive ABSTRACT We describe a method for capturing individual customer preferences for product features using a “build your own product” question. Results from this question are compared to choicebased conjoint results using the same attributes and levels and gathered from the same individuals. BACKGROUND AND INTRODUCTION Choice-based conjoint analysis has become one of the more popular market research techniques for estimating the value or utility that consumers place on different product features or attributes. Choice-based conjoint is particularly valuable when the goal of the research is the design of an optimal product or service, where “optimal” is usually defined in terms of customer preferences, but also may be defined in terms of projected revenues or projected profits. Choice simulators are used to identify these optimum product configurations. In most cases, the goal is to find the one “best” level of each feature or attribute to include in the product. While it is technically possible to simulate the impact of offering multiple products from the same manufacturer with different options, with more than a few features and levels, the task becomes daunting. Moreover, under conditions in which IIA applies, the simulations may give a misleading picture of the preference shares for the overall product line. As a result, choice-based conjoint models are not especially well-suited for products that involve “mass customization.” For manufacturers, mass customization offers the prospect of satisfying diverse customer needs while reducing the costs associated with greater product variety. However, from the standpoint of profitability, many companies give customers far too many choices. Pine et al (1993) cite as an example the fact that Nissan reportedly offered 87 different types of steering wheels. While choice-based conjoint potentially can identify the relative value of each different type of steering wheel, this technique is less suited for identifying the best mix of different wheel types to offer. Choice-based conjoint faces another problem in products where the levels of the various attributes or features have significant cost and pricing implications. While one might calculate a separate price for every concept in the design based on the specific attribute levels, this creates a few problems. In the first place, price is no longer uncorrelated with utilities of the different 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 99 attribute levels. The negative correlation of utilities between price and features may result in model failure, as Johnson, et al (1989) has demonstrated. Moreover, if the incremental cost for a given feature level is the same across all concepts, it will be impossible to estimate the price sensitivity for that level. Conditional pricing and alternative-specific designs offer limited capability for incorporating feature-specific pricing. For some types of products (e.g., automobiles, personal computers), the diversity of consumer preferences has led most manufacturers to offer a wide variety of feature options. Many of these companies now offer “design your own product” functions on their websites. These sites allow customers to configure products by selecting the options they wish to include. While we know of no instance where companies are using these sites as a market research tool, the format of these design-your-own features is well suited to Internet survey administration. Taking the “design your own product” website functions as a model, we developed an application to enable us to insert a design your own product question into a web-survey. STUDY OBJECTIVES The objective of the research described in this paper was to determine the value of combining a design your own product question with a choice-based conjoint exercise. We sought to answer these questions: • • • How do the results from a design your own question compare to preference shares estimated from a discrete choice model? Can design your own questions be used to validate results from choice-based conjoint? Can design your own questions be used in place of choice experiments? In answering these questions, we hoped to determine if the design your own results might be used to create individual level “holdout” tasks, particularly for partial profile designs. We also hoped to investigate the possibility of estimating utilities from design your own questions. IMPLEMENTATION OF THE BUILD YOUR OWN QUESTION After a respondent completes the choice experiment, a follow-up question presents the same attributes, one at a time, and asks for the most preferred level. Each level has a price attached (which may or may not be revealed), and the screen displays a “total price” based on the specific features selected so far. The respondent can move back and forth between the attributes until he or she arrives at a configuration and price that is acceptable. Figure 1 presents an example of a build your own screen. 100 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 1 At the top of the screen we display the total price for the product as configured. Below this, the screen is divided into four sections. The upper left section lists the different features. Highlighting a feature will display additional information about that feature in the upper right section. At the same time, the available levels of the feature will be displayed in the lower left section. Highlighting a level will display any descriptive information (including graphics) about the level in the lower right section. The respondent selects a feature and level, and then moves to another feature. Any feature specific changes in price are reflected in the price displayed at the top. When the respondent is satisfied with the levels for each feature and the total price, clicking on the “finish” button will submit the data to the survey engine. Respondents must select a level for each feature. CASE STUDIES We conducted two different studies incorporating the build your own question with a discrete choice experiment. Both studies involved consumer products. The data were collected through Internet surveys of Harris Poll Online (HPOL) panelists. First Study: Consumer Electronics Product The first study concerned a new consumer electronics product. The overall objective of the study was to determine the optimum configuration for the product. A secondary objective was to determine the extent to which specific optional features could command a higher price. The product was defined by nine attributes. Eight of the attributes consisted of three levels each. The ninth was a six-level attribute that resulted from combining two three-level attributes in order to eliminate improbable (from an engineering standpoint) combinations from the design. An 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 101 orthogonal and balanced fractional factorial design was created using a design table. The design incorporated conditional pricing based on one of the features. A total of four price points were employed in the study, but only the lower three price points appeared at one level of the conditional attribute, and the upper three price points appeared with the other levels of this attribute. The build your own question1 followed the choice experiment in the survey. There are two important differences between the choice experiment and the build your own question. First, price is not included as an attribute in the build your own question. Instead, a price is calculated based on the feature levels the respondent selects. Second, in this study, the “combined” attribute was changed to the separate three-level attributes in the build your own question in order to determine the extent to which consumers might prefer one of the improbable combinations. A total of 967 respondents from the HPOL online panel completed the survey. A choice model was estimated using Sawtooth Software’s CBC-HB (main effects only). Study One Results Before presenting the results from the first study, we feel it is important to point out some of the challenges that arise in comparing build your own product data with market simulations using a discrete choice model. In our case, there is no “none of the above” option in the build your own question. Respondents are forced to configure a product, selecting one level for each feature. Another difficulty lies in defining the competitive set for the market simulator. In theory, the choice set for the build your own question consists of all possible combinations of attribute levels. Additionally, many attribute levels have a specific incremental price in the build your own question. This pricing must be accommodated to avoid drawing incorrect conclusions from the comparison. In this study, this was accomplished through conditional pricing in the choice experiment for those attributes with large price differences between levels.2 With these limitations in mind, we attempted to find comparable results wherever possible. We first compared the incremental differences in the proportion of respondents choosing each level of each feature in the build your own question with the incremental changes in preference share from the market simulator. The results for four representative attributes are shown in Figure 2. Here we see that the differences in preference for each level of each attribute are in the same direction although the magnitude of the change varies. 1 This study employed an earlier version of the build your own question than that illustrated in Figure 1. In this version, each feature was displayed in a table with the relevant levels. Radio buttons controlled the selection of the feature levels. 2 For these attributes, the utility estimates encompass information about sensitivity to the price increment for different levels of each attribute. Thus, the market simulations should reflect the incremental pricing for each attribute level without specifically incrementing price, as long as the overall price is in the right range. 102 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 2 19% Attr05/L3 28% 31% 31% Attr05/L2 35% Attr04/L3 46% 43% Attr04/L2 12% Attr02/L3 Attr02/L2 9% Build your own Choice Tasks 24% 13% Attr01/L3 63% 43% 18% 19% Attr01/L2 0% 44% 10% 20% 30% 40% 50% 60% 70% % of respondents We next explored the extent to which the “ideal” products produced by each method resulted in similar preference shares. To accomplish this, two market simulations were conducted. In the first simulation, three products were constructed so that one product was comprised of the feature levels with the highest average utility, a second was comprised of all feature levels with the lowest average utility values, and the third was comprised of the levels with intermediate average utility3. The most preferred feature level was the same for five of the eight available attributes. The second most preferred level was the same across the two methods for only three attributes. The least preferred level was the same for four attributes. Any differences in preference share will be due to these mismatches between attribute levels, and such mismatches will always reduce the preference share for the ideal build your own product relative to the ideal product based on average utility estimates. As Figure 3 illustrates, the simulated preference share for the ideal build your own product is 77%, compared to 90% for an ideal product based on average utility estimates. 3 Because price was not an attribute in the build your own question but was, in fact, correlated with the attribute levels, we set the price of the most desired product at the highest level. The mid-product was priced at the intermediate level, and the least desired product at the lowest level of the price attribute. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 103 Figure 3 76.8% Most Preferred 90.0% 10.2% Second Preferred Build Your Own Choice Model 7.6% 7.9% Least Preferred 0.0% 0% 20% 40% 60% 80% 100% Share of Preference Each task in the choice experiment represents a subset of the possible combinations of features and levels. In the build your own question, respondents have the ability of “choosing” any of the 39 possible product configurations. Figure 4 shows the proportion of respondents building products with the best (most often picked) levels of all nine attributes, and of eight, seven and six attributes. Only a handful of respondents configured a product with the best level of all nine attributes. Given that this results in a product with the highest possible price, this is not surprising. Slightly more individuals (2.6%) configured a product with eight of the nine best levels. The average price across these configured products was $370, about 11% lower than that of the overall best product. Greater numbers designed products with best levels for seven out of nine (13.9%) and six out of nine (25.6%) attributes. Interestingly, the average price for these products is similar to that for the eight of nine attribute combinations. Figure 4 9 a t t ribut e s 0 .2 % Avg. price=$410 2 .6 % 8 a t t ribut e s Avg. price=$370 7 a t t ribut e s 13 .9 % Avg. price=$370 Avg. price=$376 6 a t t ribut e s 0% 5% 10 % 15 % 20% 2 5 .6 % 25% % of respondents 104 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 30% We compared choices in the build your own question with attribute and level-specific individual utility estimates from the choice model. For each respondent, we identified the most preferred level of a feature based on his or her utility estimates and compared that to the level that was selected in the build your own question. Figures 5 and 6 show the results for two individual attributes. Similar results were obtained for other attributes. One important characteristic of this particular attribute is the large increment ($100) in price between levels in the build your own question. As noted previously, conditional pricing was used in the choice experiment to reflect this increment. However, it appears that the conditional pricing may not capture the effect of this price increment. For example, only 23% of those who gave the highest utility to level three actually selected this level in the build your own question. Figure 5 % "Correct" for First Attribute 52.7% Level 3 Preferred 35% 12.1% Picked Level 3 71.3% Level 2 Preferred 23% Picked Level 2 5.4% Picked Level 1 65.5% Level 1 Preferred 26% 8.6% 0% 20% 40% 60% 80% Figure 6 % "Correct" for Second Attribute 22.6% Level 3 Preferred 34.5% 42.9% 52.5% Level 2 Preferred 33.5% Picked Level 3 Picked Level 2 13.9% Picked Level 1 28.7% Level 1 Preferred 46.0% 25.3% 0% 10% 20% 30% 40% 50% 60% 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 105 Build your own results may aid in the design of an optimal product line consisting of more than one variant based on different feature combinations. As a simple example, we looked at three attributes in combination. Here our goal is to identify two or more different products that will maximize overall penetration. Figure 7 shows the ranking of several configurations in terms of unique reach as well as the selection frequencies for the levels of each attribute. Figure 7 Best Product Configurations for Top Three Attributes First product Second product Third product Fourth product Total Attribute 1 Attribute 2 Attribute 3 Unique % choosing Level C Level C Level C 19% Level C Level C Level B 13% Level C Level B Level B 11% Level C Level B Level C 9% 69% 66% 53% 52% Second Study: Consumer Durable The second study focused on a consumer durable category. For this study, we employed a partial profile design because of the large number of attributes (12). Six attributes were constant in every choice task (brand, price, form factor and three attributes that were either more important to management or had several levels). The remaining attributes randomly rotated into and out of the choice tasks. Each respondent evaluated 25 choice tasks, of which 5 were holdout tasks. Each task presented three concepts. A “none” alternative was included. Respondents, 1170 in total, were recruited from the HPOL on-line panel. A main effects model was estimated using CBC-HB. Study Two Results In comparing the results of the choice model with those of the build your own question, we assumed that two attributes, brand and form factor, would be most informative. Previous research has shown that brand and form factor preferences are particularly strong. Moreover, brand and form factor account for most of the variability in price in the build your own exercise. Figures 8 and 9 compare the simulated results for each brand and form factor with the selections from the build your own question. In this instance, if we take the product configuration as the reference point, the choice model overestimates preference for Form A and underestimates preference for Form B. Form B is somewhat (10%) more expensive that Form A, on average. For brand, the results are similar, with the choice model overestimating some brands and underestimating others. 106 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 8 Predicted Brand Share vs. Actual Pick Rate for Form A Brand G 3.9% Brand F 3.8% 7.4% 6.5% 4.6% Brand E 8.7% Brand D Simulation 7.3% Brand C 4.4% Brand B Build Your Own 11.6% 8.1% 10.5% 6.1% 10.3% 10.1% Brand A 0% 5% 10% 15% Figure 9 Predicted Share for Form vs. Actual Pick Rate 15.7% Form C 12.6% Build Your Own Simulation 38.5% Form B 30.1% 45.8% Form A 57.8% 0% 10% 20% 30% 40% 50% 60% 70% We compared incremental effects of level changes from the choice model with level selections in the build your own question. Figure 10 shows the results for one attribute with three levels. This was a “partial” attribute, appearing in 50% of the choice tasks. This attribute has the largest price difference between levels (+$450 for level 2 and +$900 for level 3)4. So that 4 For this study, we incorporated the incremental pricing for attribute levels by adding the incremental price from the build your own question to the base price. The price range tested in the choice experiment was large enough to allow simulation of all base prices with all possible attribute specific incremental prices. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 107 the results are as comparable as possible, we looked at the differences by form factor for the two forms that accounted for 84% of the choices. These results are interesting because the differences between levels are much less pronounced for the choice model than for the build your own responses. In particular, for Form A, respondents are much more sensitive in configuring a product to the price difference between level 2 and level 3. Figure 10 Comparison of Incremental Effects of Attribute Levels Preference share/% of respondents 90 80 70 60 Sim --Form A--Attrib 1 50 BYO--Form A--Attrib 1 40 Sim --Form B--Attrib 1 BYO--Form B--Attrib 1 30 20 10 0 Level 1 Level 2 (+$450) Level 3 (+$900) We see a similar effect for the second attribute in Figure 11. We included a second simulation where the base price did not include the level-specific increment. There is very little difference between the two simulations, suggesting that the incremental prices may not be large enough to make a difference in the simulator. Figure 11 90% 80% 70% 60% Sim w/o price 50% Sim w/price 40% Build Your Own 30% 20% 10% 0) 0) Lv l8 (+ $1 5 0) Lv l7 (+ $1 5 ) (+ $2 0 Lv l6 (+ $5 0 0) Lv l5 5) (+ $1 5 Lv l4 (+ $1 2 (+ $5 0 l2 l3 Lv Lv Lv l1 108 ) 0% (+ $0 ) Preference share/% picking level Simulations with and without price increment, vs. BYO 2001 Sawtooth Software Conference Proceedings: Sequim, WA. DISCUSSION At the outset we sought to answer three questions: • • • How do the results from a design your own question compare to preference shares estimated from a discrete choice model? Can design your own questions be used to validate results from choice-based conjoint? Can design your own questions be used in place of conjoint experiments? How Do the Results Compare? With respect to the first question, we saw that the results are similar, but there appear to be some important differences. In particular, market simulations based on choice-based conjoint may be less sensitive than the build your own question to the incremental cost of different feature levels. The two tasks are similar in some respects, notably that respondents indicate their desire for particular features, but they differ in two very important aspects. First, in the choice task, the levels of the features are uncorrelated with the levels of price. In the build your own question, every feature level has some associated incremental price, even if that incremental price is $0 for a given level. Moreover, there is a relationship between the expected utility of a level and the price increment; as functionality or benefit increases, so does the incremental price. The second difference lies in the nature of the trade-offs that respondents make in each task. In a choice task, respondents must make trade-offs between the different attributes as well as between attributes and price. In a typical choice experiment, most of the concepts will be constructed so that a respondent will have to accept a less desirable level of one attribute in order to obtain a desired level of some other attribute. Sometimes the most desired level of an attribute will be offered at a relatively lower price, and sometimes at a relatively higher price. In the build your own question, respondents do not have to make compromises in selecting feature levels as long as they are willing to pay for those levels. We suspect that these differences have two consequences. First, the fact that features and price were uncorrelated in the choice tasks makes it harder to detect feature-specific sensitivity to price. Our main effects models appear to do a poor job emulating the price-responsiveness observed in the design your own question. The inclusion of conditional pricing in the first study seemed to have little impact. The second consequence is that respondents may use different choice heuristics in the two tasks. The build your own question may constrain respondents to employ a compensatory strategy with respect to price. Respondents are forced to consider all attributes to complete the question, and each attribute has some cost implication. In the choice task, however, respondents are not so constrained, and may employ non-compensatory strategies such as elimination by aspects (EBA). Use of non-compensatory strategies may have a significant impact on parameter estimates; particularly in main effects-only models (see Johnson et al, 1989). 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 109 Can We Validate CBC with Build Your Own? Our experience indicates that the build your own question provides indirect validation of choice-based conjoint results. Given the unlimited set of alternatives offered by the build your own question, it is difficult to devise market simulations that compare directly to the build your own results. Individual-level validation turns out to be quite difficult. As we have noted above, in one specific area the build your own results would appear to invalidate the choice-based conjoint results. In the build your own exercise respondents are more sensitive to incremental feature prices. At this point we cannot be sure if this lack of feature price sensitivity is a general characteristic of choice-based conjoint or specific to the models in our studies. Can Design Your Own Replace CBC? The strength of choice-based conjoint lies in the ability to estimate utility values for features and price in such a way that buyer response to different combinations of features and price can be simulated, usually in a competitive context, even if those specific combinations were never presented to survey respondents. As we have implemented it to date, the build your own question does not have this capability. However, the build your own framework may offer a way to elicit responses that can be used to estimate utilities, at least at an aggregate level. If base and incremental prices are systematically varied in some fashion, it might be possible to model the selection of different features as a function of price. Prices might be varied across individuals (requiring large samples) or perhaps within individuals through replicated build your own questions. CONCLUSIONS Based on our experience with these two studies, we believe that build your own questions may indeed add value to choice-based conjoint under specific circumstances. For those categories where buyers can choose among product or service variations defined by different levels of attributes, the build your own question may yield information that can be used to design the best combinations of features for variations within a product line. Build your own questions may also be more sensitive to feature-related price differences. REFERENCE Johnson, Eric J., Robert J. Meyer, and Sanjoy Ghose (1989), “When Choice Models Fail: Compensatory Models in Negatively Correlated Environments,” Journal of Marketing Research, 24 (August), 255-270 110 2001 Sawtooth Software Conference Proceedings: Sequim, WA. APPLIED PRICING RESEARCH Jay L. Weiner, Ph.D. Ipsos North America ABSTRACT The lifeblood of many organizations is to introduce new products to the marketplace. One of the commonly researched topics in the area of new product development is “What price should we charge?” This paper will discuss various methods of conducting pricing research. Focus will be given to concept testing systems for both new products and line extensions. Advantages and disadvantages of each methodology are discussed. Methods include monadic concept testing, the van Westendorp Premium Pricing Model (PSM) and choice modeling (conjoint and discrete choice). APPROACHES TO PRICING RESEARCH • • • • • Blunt approach – you can simply ask – “how much would you be willing to pay for this product/service?” Monadic concept – present the new product/service idea and ask “how likely would you be to buy @ $2.99?” Price Sensitivity Meter (PSM) - van Westendorp Conjoint Analysis and Choice based conjoint Discrete Choice (and Brand/price tradeoff) PRICE ELASTICITY The typical economic course presents the downward sloping demand curve. As the price is raised, the quantity demanded drops, and total revenue often falls. In fact, most products exhibit a range of inelasticity. That is, demand may fall, but total revenue increases. It is the range of inelasticity that is of interest in determining the optimal price to charge for the product or service. Consider the following question. If the price of gasoline were raised 5¢ per gallon, would you drive any fewer miles? If the answer is no, then we might raise the price of gas such that the quantity demanded is unchanged, but total revenue increases. The range of inelasticity begins at the point where the maximum number of people are willing to try the product/service and ends when total revenue begins to fall. Where the marketer chooses to price the product/service depends upon the pricing strategy. There are two basic pricing strategies. Price skimming sets the initial price high to maximize revenue. As the product moves through the product lifecycle, the price typically drops. This strategy is often used for technology based or products protected by patents. Intel, for example, prices each new processor high and as competitors match performance characteristics, Intel lowers its price. Penetration pricing sets the initial price low to maximize trial. This pricing strategy tends to discourage competition, as economies of scale are often needed to make a profit. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 111 The goal of the pricing researcher should be to understand this range of prices so as to make good strategic pricing decisions. In the blunt approach, you typically need a large number of respondents to understand purchase intent at a variety of price points. The monadic concept test also requires a fairly large number of respondents across a wide range of price points. THE VAN WESTENDORP MODEL In order to better understand the price consumers are willing to pay for a particular product or service, Dutch economist Peter van Westendorp developed the Price Sensitivity Meter. The underlying premise of this model is that there is a relationship between price and quality and that consumers are willing to pay more for a higher quality product. The PSM requires the addition of 4 questions to the questionnaire. • • • • At what price would you consider the product to be getting expensive, but you would still consider buying it? (EXPENSIVE) At what price would you consider the product too expensive and you would not consider buying it? (TOO EXPENSIVE) At what price would you consider the product to be getting inexpensive, and you would consider it to be a bargain? (BARGAIN) At what price would you consider the product to be so inexpensive that you would doubt its quality and would not consider buying it? (TOO CHEAP) Sample Data van Westendorp PSM (n=281) 100% Range of Acceptable Prices MGP=$3.38 MDP=$5.00 75% Too Expensive Expensive 50% Bargain Too Cheap IDP=$4.95 25% OPS=$3.58 0% $0 $2 $4 $6 $8 $10 $12 $14 Price ($US) 112 2001 Sawtooth Software Conference Proceedings: Sequim, WA. The indifference price point (IDP) occurs at the intersection of the bargain and expensive lines. To the right of this point, the proportion of respondents who think this product is expensive exceeds the proportion that thinks it is a bargain. If you choose to price the product less than IDP, you are losing potential profits. To the left of the indifference price, the proportion of respondents who think this price is a bargain exceeds the proportion who think it is expensive. Pricing the product in excess of the IDP causes the sales volume to decline. The IDP can be considered the “normal” price for this product. The marginal point of cheapness (MGP) occurs at the intersection of the expensive and too cheap curves. This point represents the lower bound of the range of acceptable prices. The marginal point of expensiveness (MDP) occurs at the intersection of the bargain and too expensive curves. The point represents the upper bound of the range of acceptable prices. The optimum price point (OPS) represents the point at which an equal number of respondents see the product as too expensive and too cheap. This represents the “ideal” price for this product. The range between $3.58 (OPS) and $4.95 (IDP) is typically the range of inelasticity. Newton, Miller and Smith (1993), offer an extension of the van Westendorp model. With the addition of two purchase probability questions (BARGAIN and EXPENSIVE price points), it is possible to plot trial and revenue curves. We can assume that the probability of purchase at the TOO CHEAP and TOO EXPENSIVE prices is 0. These curves will indicate the price that will stimulate maximum trial and the price that should produce maximum revenue for the company. It is the use of these additional questions that allows the researcher to frame the range of inelasticity. With the addition of purchase probability questions, it is possible to integrate the price perceptions into the model. The Trial/Revenue curves offer additional insight into the pricing question. By plotting the probability of purchase at each price point, we can identify the price that will stimulate maximum trial. By multiplying the proportion of people who would purchase the product at each price by the price of the product, we generate the revenue curve. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 113 Sample Data Trial/Revenue Curves (n=281) Max Revenue=$5.00 Trial Revenue Max Trial=$3.50 $0 $2 $4 $6 $8 $10 $12 $14 Price ($US) The difference between the point of maximum trial ($3.50) and the point of maximum revenue ($5.00) represents the relative inelasticity of the product. Inelastic products are products where there is little or no decrease in sales if the price were increased. Most products are inelastic for a narrow range of price. If we choose to price the product at the point of maximum revenue, we may lose a few customers, but the incremental revenue more than offsets the decline in sales. Monadic price concept tests are very blunt instruments for determining price sensitivity. We believe that the data suggest that the range of inelasticity is far greater than it should be. For example: For one concept tested three prices, i.e., $9.99, $12.99, and $15.99 which means we tested the same concept three times each with a different price. The results showed that the $9.99 elicited significantly more appeal than either of the other two other price points which were equal in appeal (DWB: 18%…for $9.99 vs. 14%…for $12.99 & $15.99). As such, the conclusion would be that trial would be highest at $9.99 but that the trial curve would be inelastic from $12.99 to $15.99. Therefore, one interpretation would be that if we were willing to accept lower trial by pricing at $12.99, we could increase our revenue by jumping to $15.99, as the consumer is price insensitive between $12.99 and $15.99. However, the van Westendorp analysis adds a different perspective. Using this model, the optimum price is about the same. Maximum trial is achieved at $8 while maximum revenue is achieved between $8.60 and $9.95. Beyond these price points, pricing is very elastic meaning that both trial and revenue would decrease significantly as price increases. 114 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Comparison of van Westendorp to Monadic Concept Tests In this experiment, we tested eight new product concepts. There were four cells for each of the eight concepts. A nationally representative sample was drawn for each group (Base > 300 each cell). For each new product concept, there were three monadic price cells (price client wanted to test and ±25%) and one unpriced cell with the van Westendorp and Newton/Miller/Smith questions. Both premium price and low price products were tested to determine if the relative price affected the results. Premium price products are products that are relatively expensive compared to other products in the category. Products/Prices Tested Category Healthcare Healthcare Health & Beauty Health & Beauty Snack Condiment Home Care Personal Care Low $ 4.29 $ 2.79 $ 5.59 $ 2.99 $ 2.99 $ 2.29 $ 1.89 $ 2.29 Premium Price Low Price Premium Price Low Price Premium Price Low Price Low Price Premium Price Med $ 5.69 $ 3.69 $ 7.49 $ 3.99 $ 3.99 $ 2.99 $ 2.49 $ 2.99 High $ 6.99 $ 4.59 $ 9.39 $ 4.99 $ 4.99 $ 3.69 $ 3.09 $ 3.69 Healthcare (Premium Price) 0.3 0.25 0.2 0.15 0.1 0.05 0 4 4.25 4.5 4.75 5 5.25 vW Trial 5.5 5.69 5.75 6 6.5 6.75 7 7.11 7.5 MC Trial 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 115 Healthcare (Low Price) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 $2.75 $3.00 $3.25 $3.50 $3.69 $3.75 vW Trial $4.00 $4.25 $4.50 $4.59 MC Trial Health & Beauty (Premium Price) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 $5.52 $5.61 $5.99 $6.50 $7.00 vW Trial 116 $7.50 $8.00 $8.50 $9.00 MC Trial 2001 Sawtooth Software Conference Proceedings: Sequim, WA. $9.36 $9.95 Health & Beauty (Low Price) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 $2.99 $3.49 $3.99 vW Trial $4.50 $4.99 MC Trial Snack Food (Premium Price) 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 $2.99 $3.49 $3.99 vW Trial $4.50 $4.99 MC Trial 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 117 Condiment (Low Price) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1.99 2.25 2.5 2.75 2.99 3.25 vW Trial 3.5 3.59 3.75 3.99 MC Trial Home Care (Low Price) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 $2.25 $2.50 $2.75 vW Trial 118 $3.00 $3.25 $3.50 MC Trial 2001 Sawtooth Software Conference Proceedings: Sequim, WA. $3.75 Personal Care (Premium Price) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 $1.75 $1.89 $1.99 $2.25 vW Trial $2.49 $2.75 $3.00 $3.15 $3.25 MC Trial HOW WELL DOES THIS VALIDATE? For each of the 8 products, we have actual year 1 trial (IRI), estimates of ACV and awareness and average product price (net of promotions). For two of our product concepts, the average price was below the lowest monadic price tested. Category Healthcare Healthcare Health & Beauty Health & Beauty Snack Condiment Home care Personal Care $ $ $ $ $ $ $ $ Low 4.29 2.79 5.59 2.99 2.99 2.29 1.89 2.29 Avg. Price $ 6.14 $ 3.13 $ 5.16 $ 3.47 $ 3.81 $ 2.50 $ 1.79 $ 2.44 *** *** 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 119 Predicted versus Actual Trial 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 C L N Actual M P vW Predicted V R S Monadic WHAT ABOUT CONJOINT ANALYSIS For the lower priced Healthcare product, conjoint purchase intent data were collected. In addition to price, two additional attributes were varied. This resulted in a very simple nine-card design. By selecting the correct levels of the two additional attributes tested, we can predict purchase intent for each of the three prices tested. The prices from the monadic concept test were used in the conjoint. The results suggest that conjoint results over predict purchase intent. 120 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Healthcare Product (Low Price) 0.6 0.5 0.4 0.3 0.2 0.1 0 $2.75 $3.00 $3.25 vW Trial $3.50 $3.69 $3.75 MC Trial $4.00 $4.25 $4.50 $4.59 Conjoint SUMMARY Monadic concept tests tend to over-estimate trial. This may be due to the fact that prices given to respondents in a monadic concept test do not adequately reflect sales promotion activities. Respondents may think that the price given in the concept is the suggested retail price and that they are likely to buy on deal or with a coupon. Monadic concept tests require a higher base size. A typical concept test would require 200 to 300 completes ompletes per cell. The number of cells required would depend on the prices tested, but from the results, it appears that three cells tend to over-estimate the range of inelasticity. Providing a competitive price frame might improve the results of monadic concept tests. The van Westendorp series does a reasonable job of predicting trial from a concept test without the need for multiple cells. This reduces the cost of pricing research and also the likelihood that we do not test a price low enough. The prices given by respondents are believed to represent the actual out-of-pocket expenses. This permits the researcher some understanding of the effects of promotional activities (on shelf price discounts or coupons). The van Westendorp series will also permit the research to understand the potential trial at price points higher than those that might be tested in a monadic test. Conjoint analysis tends to over-estimate trial, but competitive testing (discrete choice, CBC) should improve the results. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 121 REFERENCES David W. Lyon (2000), Pricing Research (Chapter 21) in Marketing Research: State-of-the-Art Perspectives Newton, Dennis, Jeff Miller and Paul Smith (1993). A Market Acceptance Extension to Traditional Price Sensitivity Measurement. AMA ART Forum. van Westendorp, Peter H. (1976). NSS – Price Sensitivity Meter (PSM) – A New Approach to Study Consumer Perception of Price. Venice Congress Main Sessions p. 139-167. Weiner, Jay L. & Lee Markowitz (2000). Calibrating / Validating an adaptation of the van Westendorp model. AMA ART Forum Poster Session. 122 2001 Sawtooth Software Conference Proceedings: Sequim, WA. RELIABILITY AND COMPARABILITY OF CHOICE-BASED MEASURES: ONLINE AND PAPER-AND-PENCIL METHODS OF ADMINISTRATION Thomas W. Miller A.C. Nielsen Center, School of Business University of Wisconsin-Madison David Rake∗ Reliant Energy Takashi Sumimoto∗ Harris Interactive Peggy S. Hollman∗ General Mills ABSTRACT Are choice-based measures reliable? Are measures obtained from the online administration of choice tasks comparable to measures obtained from paper-and-pencil administration? Does the complexity of a choice task affect reliability and comparability? We answered these questions in a test-retest study. University student participants made choices for 24 pairs of jobs in test and retest phases. Students in the low task complexity condition chose between pairs of jobs described by six attributes; students in the high task complexity condition chose between pairs of jobs described by ten attributes. To assess reliability or comparability, we used the number of choices in agreement between test and retest. We observed high reliability and comparability across methods of administration and levels of task complexity. INTRODUCTION Providers and users of marketing research have fundamental questions about the reliability and validity of online measures (Best et al. 2001; Couper 2000; Miller and Gupta 2001). Studies in comparability help to answer these questions. In this paper, we present a systematic study of online and paper-and-pencil methods of administration. We examine the reliability and comparability of choice-based measures. Our study may be placed within the context of studies that compare online and traditional methods of administration (Miller 2001; Miller and Dickson 2001; Miller and Panjikaran 2001; Witt 2000). Methods We implemented the online choice task using zTelligence™ survey software, with software and support from InsightTools and MarketTools. To provide a relevant task for university student participants, we used a simple job choice task. Students were given Web addresses and personal identification numbers. The online task consisted of a welcome screen of instructions and 24 choice pairs, with one choice pair per screen. Student participants clicked on button images under job descriptions to indicate choices. Students were permitted to work from any ∗ Research completed while students at the A.C. Nielsen Center, School of Business, University of Wisconsin-Madison. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 123 Web-connected device. Computer labs were available, and help desk assistance was available. Initial online instructions to students were as follows: In this survey you will be given 24 pairs of jobs. For each pair of jobs, your task is to pick the job that you prefer. Indicate your choice by clicking on the button below the preferred job. After you have made your choice, press the “CONTINUE” button to move on to the next pair of jobs. For each choice pair, we repeated the instruction, “Pick the job that you prefer.” Paper-and-pencil administration consisted of one page of instructions describing the choice task, a 24-page booklet with one choice pair per page, and a separate answer sheet. Instructions for the paper-and-pencil method were similar to instructions for the online method, except that students were asked to indicate job preferences by using a pencil to fill in circles on the answer sheet. Answer sheets were machine-scored. To examine choice tasks of different levels of complexity, we constructed six- and tenattribute choice designs. The six-attribute design consisted of the first six attributes from the tenattribute design. We employed identical 24-pair choice tasks in test and retest phases. Some participants received the six-attribute task; others received the ten-attribute task. Table 1 shows the job attributes and their levels. To design the choice tasks, we worked with the set of ten job attributes and their levels. Sideby-side choice set pairs were generated so that all attribute levels for the left-hand member of a pair would be different from attribute levels of the right-hand member of the pair. We assured within-choice-pair balance by having levels within attributes matched equally often with one another. Using the PLAN and OPTEX procedures of the Statistical Analysis System (SAS), we generated a partially balanced factorial design. The design was balanced for main effects, with each level of each attribute occurring equally often on the left- and right-hand-sides of choice pairs. The design was balanced for binary attribute two-way interaction effects, with both levels of a binary attribute occurring equally often with both levels of all other binary attributes. Twoway interactions between the income attribute and all binary attributes were also balanced in the design. Because the six-attribute design was a subset of the ten-attribute design, balance for the ten-attribute design applied as well to the six-attribute choice design. We employed a follow-up survey, which was administered online to all participants after the retest. Self-reported day and time of test and retest, demographic data, and task preferences were noted in the follow-up survey. Students were also encouraged to make open-ended comments about the task. 124 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Attribute Levels Annual salary $35,000 $40,000 $45,000 $50,000 Typical work week 30 hours 40 hours 50 hours 60 hours Schedule Fixed work hours Flexible work hours Annual vacation 2 weeks 4 weeks Location Small city Large city Climate Mild, small seasonal changes Large seasonal changes Type of organization Work in small organization Work in large organization Type of industry High-risk, growth industry Low-risk, stable industry Work and life On call while away from work Not on call while away from work Signing bonus No signing bonus $5,000 signing bonus Table 1: Job Attributes and Levels Methods of administration in the test and retest phases defined a between-subjects factor in our study. Data from participants receiving the same method of administration in test and retest (online test and retest or paper-and-pencil test and retest) provided information about the reliability of measures. Data from participants receiving both methods of administration (online test followed by paper-and-pencil retest or paper-and-pencil test followed by online retest) provided information about the comparability of methods. Methods of administration were 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 125 completely crossed with the other between-subjects factor, task complexity (low complexity at six attributes or high complexity at ten attributes), to yield eight treatment conditions. We recruited student volunteers from an undergraduate course in marketing management at the University of Wisconsin-Madison. We assembled packets with instructions and treatment materials for the test phase. Packets for the eight treatment conditions were randomly ordered. and distributed to students at the conclusion of their Monday classes. On Wednesday of the same week, students went to a central location, returned their test-phase packets and received packets for the retest phase and follow-up survey. The incentive for students, five points toward their grade in the class, was provided only to those who completed the test, retest, and follow-up surveys. Table 2 shows response rates across the eight treatment conditions. In three out of four testretest conditions, we observed higher response rates for participants who received the sixattribute task than for participants who received the ten-attribute task. This is not surprising given that the six-attribute task can be expected to require less effort to complete. Task Complexity Number Distributed Number Completed Response Rate (Percent Complete) Online-Online Online-Paper Paper-Online Paper-Paper Six Attributes Six Attributes Six Attributes Six Attributes 48 56 51 57 30 43 44 49 62.5 76.8 86.3 86.0 Online-Online Online-Paper Paper-Online Paper-Paper Ten Attributes Ten Attributes Ten Attributes Ten Attributes 55 50 50 48 37 36 36 31 67.3 72.0 72.0 64.6 Methods of Administration Table 2: Response Rates by Methods of Administration and Task Complexity (Percent Completing Test, Retest, and Follow-up Questionnaire) Our analysis of reliability and comparability focused upon the number of choices in agreement between test and retest. The number of choices in agreement represented a simple response variable to compute and understand. We would expect other choice-based measures, such as estimates of job choice share, attribute importance, or utility, to follow a similar pattern of results. That is, if we see a high number of choices in agreement between test and retest, then we expect derivative measures to have high agreement between test and retest. 126 2001 Sawtooth Software Conference Proceedings: Sequim, WA. RESULTS AND CONCLUSIONS We observed high levels of test-retest reliability for both the online and paper-and-pencil methods. We also observed high levels of comparability between online and paper-and-pencil methods. Median numbers of choices in agreement were between 20 and 21 for a task involving 24 choice pairs. High levels of reliability and comparability were observed for both the six- and ten-attribute tasks. Within a classical hypothesis testing framework using generalized linear models and our completely crossed two-factor design (methods of administration by task complexity), we observed no statistically significant differences across treatments for the number of choices in agreement. Table 3 shows summary statistics for the number of choices in agreement between test and retest across the eight treatment conditions. Figure 1, a histogram trellis, shows the study data for the number of choices in agreement between test and retest. Another view of the results is provided by the box plot trellis of Figure 2, which shows the proportion of choices in agreement between test and retest. Methods of Administration Task Complexity Number of Choices in Agreement Minimum Median Mean Online-Online Online-Paper Paper-Online Paper-Paper Six Attributes Six Attributes Six Attributes Six Attributes 16 10 11 11 21 21 21 21 21.1 20.6 20.4 20.5 Online-Online Online-Paper Paper-Online Paper-Paper Ten Attributes Ten Attributes Ten Attributes Ten Attributes 8 15 11 14 20 20.5 20 21 19.5 20.4 19.6 20.3 Table 3. Number of Choices in Agreement by Methods of Administration and Task Complexity 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 127 Figure1. Histogram Trellis for the Number of Choices in Agreement 10 15 10 20 15 20 Ten-Attribute Task Paper-Paper Ten-Attribute Task Paper-Online Ten-Attribute Task Online-Paper Ten-Attribute Task Online-Online 40 30 Percent of Total 20 10 0 Six-Attribute Task Online-Paper Six-Attribute Task Online-Online Six-Attribute Task Paper-Online Six-Attribute Task Paper-Paper 40 30 20 10 0 10 15 20 10 15 20 Number of Choices in Agreement between Test and Retest Figure 2. Box Plot Trellis for Proportion of Choices in Agreement 0.4 0.6 0.8 1.0 0.4 Online-Paper Online-Online Paper-Online 0.6 0.8 1.0 Paper-Paper Ten-Attribute Task Six-Attribute Task 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Proportion of Choices in Agreement We conclude that online and paper-and-pencil tasks are highly reliable and highly comparable under favorable conditions. What are “favorable conditions”? In the present study, the task was simple, involving choice pairs with at most ten attributes. The task was short, involving 24 identical choice pairs in test and retest phases. The task was relevant (job choices for students) 128 2001 Sawtooth Software Conference Proceedings: Sequim, WA. and the participants were motivated (by extra credit points). The participants were Web-savvy and had access to computer lab assistance. Finally, the time between test and retest was short, fewer than three days for most participants. The reader might ask, “If you could not demonstrate high levels of reliability and comparability under such favorable conditions, where could you demonstrate it?” This is a reasonable question. Our research is just beginning, an initial study in comparability. Additional research is needed to explore reliability and comparability across more difficult choice tasks, both in terms of the number of alternative profiles in each choice set and in terms of the number of attributes defining profiles. Further research should also examine the performance of less Web-savvy, non-university participants. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 129 ACKNOWLEDGMENTS Many individuals and organizations contributed to this research. Bryan Orme of Sawtooth Software provided initial consultation regarding previous conjoint and choice study research. InsightTools and MarketTools provided access to online software tools and services. Holly Pilch and Krista Sorenson helped with the recruitment of subjects and with the administration of the choice-based experiment. Caitlyn A. Beaudry, Janet Christopher, and Nicole Kowbel helped to prepare the manuscript for publication. REFERENCES Best, S.J., Krueger, B., Hubbard, C., and Smith, A. (2001). ``An Assessment of the Generalizability of Internet Surveys,'' Social Science Computer Review, 19(2), Summer, 131145. Couper, M. P. (2000). ``Web Surveys: A Review of Issues and Approaches,'' Public Opinion Quarterly, 64:4, Winter, 464-494. Miller, T. W. (2001). ``Can We Trust the Data of Online Research?'' Marketing Research, Summer, 26-32. Miller, T. W. and Dickson, P. R. (2001). ``On-line Market Research,'' International Journal of Electronic Commerce, 3, 139-167. Miller, T.W. and Gupta, A. (2001). Studies of Information, Research, and Consulting Services (SIRCS): Fall 2000 Survey of Organizations, A.C. Nielsen Center for Marketing Research, Madison, WI. Miller, T.W. and Panjikaran, K. (2001). Studies in Comparability: The Propensity Scoring Approach, A.C. Nielsen Center for Marketing Research, Madison, WI. Witt, Karlan J. (2000). ``Moving Studies to the Web,'' Sawtooth Software Conference Proceedings, 1-21 130 2001 Sawtooth Software Conference Proceedings: Sequim, WA. TRADE-OFF STUDY SAMPLE SIZE: HOW LOW CAN WE GO?1 Dick McCullough MACRO Consulting, Inc. ABSTRACT The effect of sample size on model error is examined through several commercial data sets, using five trade-off techniques: ACA, ACA/HB, CVA, HB-Reg and CBC/HB. Using the total sample to generate surrogate holdout cards, numerous subsamples are drawn, utilities estimated and model results compared to the total sample model. Latent class analysis is used to model the effect of sample size, number of parameters and number of tasks on model error. INTRODUCTION Effect of sample size on study precision is always an issue to commercial market researchers. Sample size is generally the single largest out-of-pocket cost component of a commercial study. Determining the minimum acceptable sample size plays an important role in the design of an efficient commercial study. For simple statistical measures, such as confidence intervals around proportions estimates, the effect of sample size on error is well known (see Figure 1). For more complex statistical processes, such as conjoint models, the effect of sample size on error is much more difficult to estimate. Even the definition of error is open to several interpretations. Figure 1 Sample Error for Proportions at 50% Confidence Interval 25 20 15 Confidence Interval 10 5 0 0 200 400 600 Sample Size 1 The author wishes to thank Rich Johnson for his invaluable suggestions and guidance during the preparation of this paper. The author also thanks POPULUS and The Analytic Helpline, Inc. for generously sharing commercial data sets used in the analysis. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 131 Many issues face practitioners when determining sample size: • Research objectives • Technique • Number of attributes and levels • Number of tasks • Expected heterogeneity • Value of the information • Cost and timing • Measurement error • Structure and efficiency of experimental design: o Fixed designs o Blocked designs o Individual-level designs Some of these issues are statistical in nature, such as number of attributes and levels, and some of these issues are managerial in nature, such as value of the information, cost and timing. The commercial researcher needs to address both types of issues when determining sample size. Objectives The intent of this paper is to examine a variety of commercial data sets in an empirical way to see if some comments can be made about the effect of sample size on model error. Additionally, the impact of several factors: number of attributes and levels, number of tasks and trade-off technique, on model error will also be investigated. Method For each of five trade-off techniques, ACA, ACA/HB, CVA, HB-Reg, and CBC/HB, three commercial data sets were examined (the data sets for ACA, and CVA also served as the data sets for ACA/HB and HB-Reg, respectively). Sample size for each data set ranged between 431 and 2,400. Since these data sets were collected from a variety of commercial marketing research firms, there was little control over the number of attributes and levels or the number of tasks. Thus, while there was some variation in these attributes, there was less experimental control than would be desired, particularly with respect to trade-off technique. 132 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 1 CBC/HB Data set 1 Data set 2 Data set 3 CVA,HB-Reg Data set 1 Data set 2 Data set 3 ACA,ACA/HB Data set 1 Data set 2 Data set 3 Attr Lvls Pars Tasks df SS 4 6 5 14 17 25 11 12 21 8 18 12 -3 +6 -9 612 422 444 6 4 6 24 9 13 19 6 8 30 10 16 +11 +4 +8 2,400 431 867 25 5 17 78 24 63 54 20 47 782 500 808 Notice in Table 1 above that the number of parameters and number of tasks are somewhat correlated with trade-off technique. CBC/HB data sets tended to have fewer degrees of freedom (number of tasks minus the number of parameters) than CVA data sets. ACA data sets had a much greater number of parameters than either CBC/HB or CVA data sets. These correlations occur quite naturally in the commercial sector. Historically, choice models have been estimated at the aggregate level while CVA models are estimated at the individual level. By aggregating across respondents, choice study designers could afford to use fewer tasks than necessary for estimating individual level conjoint models. Hierarchical Bayes methods allow for the estimation of individual level choice models without making any additional demands on the study’s experimental design. A major benefit of ACA is its ability to accommodate a large number of parameters. For each data set, models were estimated using a randomly drawn subset of the total sample, for the sample sizes of 200, 100, 50 and 30. In the cases of ACA and CVA, no new utility estimation was required, since each respondent’s utilities are a function of just that respondent. However, for CBC/HB, HB-Reg and ACA/HB, new utility estimations occurred for each draw, since each respondent’s utilities are a function of not only that respondent, but also the “total” sample. For each sample size, random draws were replicated up to 30 times. The number of replicates increased as sample size decreased. There were five replicates for n=200, 10 for n=100, 20 for n=50 and 30 for n=30. The intent here was to stabilize the estimates to get a true sense of the accuracy of models at that sample size. Since it was anticipated that many, if not all, of the commercial data sets to be analyzed in this paper would not contain holdout choice tasks, models derived from reduced samples were compared to models derived from the total sample. That is, in order to evaluate how well a smaller sample size was performing, 10 first choice simulations were run for both the total sample model and each of the reduced sample models, with the total sample model serving to generate surrogate holdout tasks. Thus, MAEs (Mean Absolute Error) were the measure with which models were evaluated (each sub-sample model being compared to the total sample model). 990 models (5 techniques x 3 data sets x 66 sample sizes/replicate combinations) were 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 133 estimated and evaluated. 9,900 simulations were run (990 models x 10 simulations) as the basis for the MAE estimations. Additionally, correlations were run, at the aggregate level, between the mean utilities from each of the sub-sample models and the total sample model. Correlation results were reported in the form 100 * (1-rsquared), and called, for the duration of this paper, mean percentage of error (MPE). It should be noted that there is an indeterminacy inherent in conjoint utility scaling that makes correlation analysis potentially meaningless. Therefore, all utilities were scaled so that the levels within attribute summed to zero. This allowed for meaningful correlation analysis to occur. SAMPLE BIAS ANALYSIS Since each sub-sample was being compared to a larger sample, of which it was also a part, there was a sample bias inherent in the calculation of error terms. Several studies using synthetic data were conducted to determine the magnitude of the sample bias and develop correction factors to adjust the raw error terms for sample bias. Sample Bias Study 1 For each of four different scenarios, random numbers between 1 and 20 were generated 10 times for two data sets of sample size 200. In the first scenario, the first 100 data points were identical for the two data sets and the last 100 were independent of one another. In the second scenario, the first 75 data points were identical for the two data sets and the last 125 were independent of one another. In the third scenario, the first 50 data points were identical for the two data sets and the last 150 were independent of one another. And in the last scenario, the first 25 data points were identical for the two data sets and the last 175 were independent of one another. The correlation between the two data sets, r, approximately equals the degree of overlap, n/N, between the two data sets (Table 2). 134 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 2 N=200 n= r= n/N = 100 0.527451 0.474558 0.611040 0.563223 0.487692 0.483789 0.524381 0.368708 0.446393 0.453217 75 0.320534 0.411911 0.310900 0.287369 0.398193 0.473380 0.471472 0.274371 0.401521 0.389331 50 0.176183 0.255339 0.226798 0.223945 0.368615 0.229888 0.288293 0.252346 0.245936 0.139375 25 0.092247 0.142685 0.111250 0.194286 0.205507 -0.095050 0.250967 0.169203 0.109158 0.184337 0.494045 0.500000 0.373898 0.375000 0.240672 0.250000 0.136459 0.125000 Sample Bias Study 2 To extend the concept further, a random sample of 200 was generated, a second sample of 100 was created where each member of the second sample was equal to a member of the first sample and a third sample of a random 100 was generated, independent of the first two. For each of the three samples, the mean was calculated. This process was replicated 13 times and the mean data are reported below (Table 3). The absolute difference (MAE) between the first two data sets is 0.147308 and the absolute difference between the first and third data sets is 0.218077. By dividing the MAE for the first two data sets by the finite population correction factor (sqrt(1-n/N)), the MAEs become quite similar. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 135 Table 3 N=200 11.07500 10.27500 10.85000 10.59500 9.99000 9.73500 10.55500 11.44000 10.41000 10.13000 10.34000 10.29500 10.85500 10.50346 n=100 11.18000 10.15000 11.15000 10.51000 9.92000 10.11000 11.30000 11.68000 10.33000 10.55000 9.84000 10.86000 10.88000 10.65077 n=100 9.54000 11.15000 10.62000 10.81000 10.88000 11.19000 11.43000 10.88000 9.37000 10.87000 11.23000 11.46000 9.95000 10.72154 MAE 0.147308 0.218077 MAE/sqrt(1-n/N) = 0.208325 Sample Bias Study 3 To continue the extension of the concept, a random sample of 200 was generated, a second sample of 100 was created where each member of the second sample was equal to a member of the first sample and a third sample of a random 100 was generated. The squared correlation was calculated for the first two samples and for the first and third samples. This procedure was replicated 11 times. The 11 squared correlations for the first two samples were averaged as were the 11 squared correlations for the first and third samples. MPEs were calculated for both mean r-squares (Table 4). The MPE for the first two sample is substantially smaller than the MPE for the first and third samples. By dividing the MPE for the first two samples by the square of the finite population correction factor (1-n/N), the MPEs become quite similar. Note that it is somewhat intuitive that the correction factor for the MPEs is the square of the correction factor for the MAEs. MPE is a measure of squared error whereas MAE is a measure of first power error. 136 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 4 rsq= rsq= rsq= rsq= rsq= rsq= rsq= rsq= rsq= rsq= rsq= average rsq= MPE= MPE/(1-n/N)= ns=100 0.603135 0.648241 0.357504 0.303370 0.790855 0.883459 0.829014 0.477881 0.798317 0.425018 0.785462 0.627478 37.252220 74.504450 n(R)=100 0.099661 0.048967 0.111730 0.099186 0.178414 0.379786 0.182635 0.270630 0.010961 0.462108 0.003547 0.167966 83.203400 Sample Bias Study 4 Finally, the synthetic data study below involves more closely replicating the study design used in this paper. Method The general approach was: • Generate three data sets o Each data set consists of utility weights for three attributes o Utility weights for the first and third data sets are randomly drawn integers between 1 and 20 o Sample size for the first data set is always 200 o Sample size for the second and third data sets varies across 25, 50 and 100 o The second and third data sets always are of the same size o The second data set consists of the first n cases of the first data set, where n = 25, 50 or 100 • Define either a two, three, four or five product scenario • Estimate logit-based share of preference models for each of the three data sets, calculating shares at the individual level, then averaging • Calculate MAEs for each of the second and third data sets, compared to the first, at the aggregate level • Calculate MPEs (mean percent error = (1- rsq(utils-first data set, utils-other data set))*100) for each of the second and third data sets, compared to the first, at the aggregate level • Redraw the sample 50 times for each scenario/sample size and make the above calculations • Calculate mean MAEs and MPEs for each of 50 random draws for each model • 36 models (3 data sets x 4 market scenarios x 3 sample sizes) 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 137 Note: Empirically, the ratio of random sample MAE to overlapping sample MAE equals the scalar that corrects the overlapping sample MAE for sample bias. Similarly for MPE. The issue, then, is to develop a formula for the correction factor that closely resembles the ratio of random sample error/overlapping sample error. CONCLUSION As suggested by Synthetic Data Study 2, the formula (1/(1-percent overlap))^0.5 may represent the desired scalar for correction for MAE. Similarly, as suggested by Synthetic Data Study 3, the formula 1/(1-percent overlap) may represent the desired scalar for correction for MPE: Table 5 MAE Percent Overlap 12.5% (n=25) 25% (n=50) 50% (n=100) (1/1-%overlap)^0.5 1.07 1.15 1.41 random/overlap 1.17 1.32 1.56 Percent Overlap 12.5% (n=25) 25% (n=50) 50% (n=100) 1/1-%overlap 1.14 1.33 2.00 random/overlap 1.18 1.84 2.95 MPE 138 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 2 Theoretical vs. Empirical Adjustment Factors 1.8 1.6 1.4 MAE 1.2 1 1/(1-%)^0.5 E(Rn)/E(n) 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 Sample Size Figure 3 Theoretical vs. Empirical Adjustment Factors 3.5 3 MPE 2.5 2 1/(1-%) E(Rn)/E(n) 1.5 1 0.5 0 0 20 40 60 80 100 120 Sample Size 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 139 Additional conclusions: • There is a definite bias due to overlapping sample, both in MAE and MPE. • This bias appears to be independent of the number of products in the simulations (see Tables 6 and 7). • The bias is directly related to the percent of the first data set duplicated in the second. • The amount of bias is different for MAE and MPE. Table 6 MAE n=25 n=R25 two products three products four products 0.077167327 0.065818796 0.04373055 0.081864359 0.078921973 0.05091588 five products 0.03523201 0.04456126 mean 0.055487 0.064066 n=50 n=R50 0.041639879 0.046603952 0.057030973 0.057865728 0.02920880 0.03926258 0.02401724 0.03140596 0.035367 0.046391 n=100 n=R100 0.024216460 0.024658317 0.033042464 0.040345804 0.01847943 0.02954819 0.01383198 0.02281538 0.020297 0.031438 Table 7 MPE n=25 n=R25 two products three products four products 0.707187240 0.687751783 0.856957370 0.785403871 0.870813277 0.869094024 five products 0.664759341 0.884405920 mean 0.729164 0.852429 n=50 n=R50 0.242856551 0.312908934 0.292542572 0.437063715 0.554906027 0.453530099 0.246179851 0.552845437 0.273622 0.499586 n=100 n=R100 0.094198823 0.096766941 0.123103025 0.281835972 0.335639163 0.490892078 0.099623936 0.296887426 0.103423 0.351314 SAMPLE SIZE STUDY RESULTS Referring to the error curve for proportions once again (Figure 1), a natural point to search for in the error curve would be an elbow. An elbow would be a point on the curve where any increase in sample size would result in a declining gain in precision and any decrease in sample size would result in an increasing loss in precision. This elbow, if it exists, would identify the maximally efficient sample size. 140 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Visually, and intuitively, an elbow would appear as noted in Figure 4. Figure 4 Sample Curve With Elbow 25 Error 20 15 10 5 0 0 200 400 600 800 1000 Sample Size To formally identify an elbow, one would need to set the third derivative of the error function to zero. It is easy to demonstrate that, for the proportions error curve, the third derivative of the error function cannot be zero. Therefore, for a proportions error curve, an elbow does not exist. Below in Figure 5 and in Figure 7, the error curves for both the MAE and MPE error terms have been plotted for the aggregate data, that is, for all five techniques averaged together. In Figures 6 and 8, the error curves for each trade-off technique has been plotted separately. The MAE curves are all similar in shape to one another as are the MPE curves. Visually, the MAE curves appear to be proportionate to 1/sqrt(n) and the MPE curves appear to proportionate to 1/n. By regressing the log of the error against the log of sample size it can be confirmed that the aggregate MAE is indeed proportionate to 1/sqrt(n) and the aggregate MPE proportionate to 1/n (coefficients of –0.443 and –0.811, respectively). The third derivative of both 1/sqrt(n) and 1/n can never equal zero. Therefore, neither of these error curves can have an elbow. Figure 5 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 141 Grand Mean 6 5 MAE 4 3 Grand MAE 2 1 0 0 50 100 150 200 250 Sample Size Figure 6 Grand Mean MAE's 8 7 6 Grand MAE 5 MAE CBC/HB MAE CVA MAE 4 HBReg MAE ACA MAE 3 ACA/HB MAE 2 1 0 0 50 100 150 200 Sample Size 142 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 250 Figure 7 Grand Mean 7 6 5 MPE 4 MPE Proportions 3 2 1 0 0 50 100 150 200 250 Sample Size Figure 8 Grand Mean MPE's 12 10 8 Grand MPE MPE CBC/HB MPE CVA MPE 6 HBReg MPE ACA MPE ACA/HB MPE 4 2 0 0 50 100 150 200 250 Sample Size Using the aggregate MAE and MPE curves as surrogate formulae, tables of error terms as a function of sample size have been constructed below. Given that no elbow exists for these curves, it is left to the researcher, just as it is with proportions curves, to determine the level of error that is acceptable. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 143 There is substantial increase in precision (or decrease in error) when increasing sample from 30 to 50, both for MAE and MPE. There is also substantial increase in precision in terms of both MAE and MPE when increasing sample size from 50 to 75. However, the amount of increased precision may become less relevant to many commercial studies when increasing sample size beyond 75 or 100. Table 8 Estimated MAE by Sample Size Sample Size 30 50 75 100 125 150 175 200 MAE 5.8 4.6 3.9 3.5 3.2 3.0 2.7 2.5 Table 9 Estimated MPE by Sample Size Sample Size 30 50 75 100 125 150 175 200 MPE 6.4 3.9 2.7 2.0 1.6 1.4 1.2 1.0 A careful review of Figures 6 and 8 will reveal a pattern of error terms which might suggest that certain trade-off techniques generate lower or higher model error terms than others. This conclusion, at least based on the data presented here, would be false. Each error term is based on total sample utilities computed with a given trade-off technique. Thus, for example, the CVA 144 2001 Sawtooth Software Conference Proceedings: Sequim, WA. MPE at a sample size of 100 is determined by taking the CVA-generated mean utilities from the five replicates of the 100 sub-sample and correlating them with the CVA-generated mean utilities for the total sample. Similarly, for HB-Reg, the sub-sample mean utilities are correlated with the total sample mean HB-Reg utilities. Even though the underlying data are exactly the same, MPEs for the CVA sub-samples are based on one set of “holdouts” (total sample CVAbased utilities) while the MPEs for the HB-Reg sub-samples are based on an entirely separate and different set of “holdouts” (total sample HB-Reg-based utilities). Because the reference points for calculating error are not the same, conclusions contrasting the efficiency of the different trade-off techniques cannot be made. To illustrate how different the total sample models can be, MAEs were calculated comparing the total sample CVA-based models with the total sample HB-Reg-based models for three data sets. Data set 1 Data set 2 Data set 3 MAE 7.7 6.5 6.7 These MAEs are larger than most of the MAEs calculated using much smaller sample sizes. Thus, while we cannot compare error terms as calculated here, we can conclude that different trade-off techniques can generate substantially different results. Having said the above, it is still interesting to note that both the ACA and ACA/HB utilities and models showed remarkable stability at low sample sizes despite the burden of a very large number of parameters to estimate; much larger number of parameters than any other of the techniques. LATENT CLASS MODELS The above analysis is based upon a data set of 1,950 data points, 975 data points for each error term, MAE and MPE. Excluding ACA data, there were 585 data points for each error term. Latent Class models were run on these data to explore the impact on model error of sample size, number of attributes and levels (expressed as number of parameters) and number of tasks. ACA data were excluded from the latent class modeling because of the fundamentally different nature of ACA to CVA and CBC. A variety of model forms were explored, beginning with the simplest, such as error regressed against sample size. The models that yielded the best fit were of the form: MAE = k*(sqrt(Pc/(na*Tb))) and MPE = k*Pc/(na*Tb) 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 145 Where P is the number of parameters, n is sample size, T is number of tasks and k, c, a and b are coefficients estimated by the model. The k coefficient in the MAE model was not significantly different from 1 and therefore effectively drops out of the equation. For both the MAE and MPE models, latent class regressions were run for solutions with up to 12 classes. In both cases, the two class solution proved to have the optimal BIC number. Also in both models, sample size (n) and number of tasks (T) were class independent while number of parameters was class dependent. In both models, all three independent variables were highly significant. It is interesting to note that the most effective covariate attribute was, for the MAE model, trade-off technique (CBC/HB, CVA, HB-Reg). In that model, CBC/HB data points and HB-Reg data points tended to be members of the same class while CVA data points tended to be classified in the other class. For the MPE model, the most effective covariate was data type (CBC, CVA), which would, by definition group CVA data points and HB-Reg data points together, leaving CBC/HB data points in the other class. Table 10 MAE 2-Latent Class Model Output Latent Variable(s) Intercept Covariates Technique CBC/HB CVA HB-Reg Dependent Variable (gamma) Class1 -0.0395 Class2 0.0395 Wald 0.0119 p-value 0.91 Class1 Class2 Wald p-value 0.9122 -1.8192 0.907 -0.9122 1.8192 -0.907 7.9449 0.019 (beta) Class1 Class2 Wald p-value Wald(=) p-value 1.5988 1.2358 905.8524 2.00E-197 3.77E+01 8.30E-10 -0.4166 -0.4166 481.2585 1.10E-106 0.00E+00 . -0.2255 -0.2255 41.0711 1.50E-10 0.00E+00 . 0.1471 0.3588 62.4217 2.80E-14 14.2287 0.00016 Class1 0.5751 Class2 0.4249 logAdjVal Predictors logn logT logP Class Size 146 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 11 MPE 2-Latent Class Model Output Latent Variable(s) Intercept Covariates DataType CBC CVA Dependent Variable (gamma) Class1 1.0976 Class2 -1.0976 Wald 4.7383 p-value 0.03 Class1 Class2 Wald p-value 1.2901 -1.2901 -1.2901 1.2901 6.6741 0.0098 (beta) Class1 Class2 Wald p-value Wald(=) p-value 0.7849 2.9455 575.3608 1.20E-125 252.1947 8.60E-57 2.0587 0.1556 499.858 2.90E-109 186.1446 2.20E-42 -0.7422 -0.7422 79.3514 5.20E-19 0 . -0.9422 -0.9422 467.1816 1.30E-103 0 . Class1 0.6005 Class2 0.3995 logAdjVal Predictors logP logT logn Class Size CONCLUSIONS Minimum sample size must be determined by the individual researcher, just as is the case with simple proportions tests. There is no obvious “elbow” in the error curve which would dictate a natural minimum sample size. However, using the aggregate error tables as a guide, sample sizes of approximately 75 to 100 appear to be sufficient to provide reasonably accurate models. Larger sample sizes do not provide a substantial improvement in model error. If fact, sample sizes as low as 30 provided larger but not unreasonable error terms, suggesting that, in some instances, small sample sizes may be appropriate. These data do not suggest that sample size needs to be larger for any trade-off technique relative to the others. Specifically, HB methods do not appear to require greater sample size than traditional methods. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 147 In addition to sample size, both the number of tasks and the number of parameters being estimated play a significant role in the size of model error. An obvious conclusion from this finding is that when circumstances dictate the use of small sample sizes, the negative effects on model precision can be somewhat offset by either increasing the number of tasks and/or decreasing the number of parameters estimated. These results appear consistent for both error terms calculated for this study: MAE and MPE. DISCUSSION There are many aspects of this study which could be improved in future research. The inclusion of more data points would provide better estimates of the shape of the error curve. More replicates at lower samples sizes would provide more stability. MSE (Mean Squared Error) could be included as an additional error term that may prove to be more sensitive than MAE. The most serious limitation to this paper is the absence of objective standards, that is, holdout cards. Ideally, holdout cards and also attributes and levels would be identical across trade-off techniques. This would require custom designed studies for the purpose of sample size research. An alternative to funding fieldwork for a non-commercial study would be to construct synthetic data sets based on the means and co-variances of existing, commercial data sets. If the synthetic data sets were constructed, the sample bias problem would be eliminated, a variety of sample sizes could be independently drawn and attribute co-linearity, which commonly exist in commercial data sets, would be maintained. There are other factors that may affect model error. The number of tasks may have a nonlinear relationship to model error. Increasing the number of tasks increases the amount of information available to estimate the model. Excessive number of tasks, however, may increase respondent fatigue to the point of offsetting the theoretical gain in information. Many aspects of measurement error, such as method of data collection (online vs telephone vs mall intercept), use of physical or visual exhibits, interview length, level of respondent interest, etc. may all play a role in model error that could affect the ultimate decision regarding sample size. The ultimate question that remains unanswered is, what is the mathematics behind model error? If a formula could be developed, as exists for proportions, researchers could input various study parameters, such as number of tasks, number of parameters, sample size, etc. and chart the error term by sample size. They could then make an informed decision, weighing both the technical and managerial aspects, and select the sample size most appropriate for that situation. 148 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Johnson, Richard (1996), “Getting the Most From CBC – Part 1,” Sawtooth Software, Inc., Sequim, WA. Johnson, Richard M. and Bryan K. Orme (1996), “How Many Questions Should You Ask In Choice-Based Conjoint Studies?” 1996 Advanced Research Techniques Forum Proceedings, American Marketing Association. Moskowitz, Howard (2000), “Stability of the Mean Utilities In Hybrid Conjoint Measurement,” 2000 Advanced Research Techniques Forum Poster Presentation, American Marketing Association. Orme, Bryan (1998), “Sample Size Issues for Conjoint Analysis Studies,” Sawtooth Software, Inc., Sequim, WA 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 149 150 2001 Sawtooth Software Conference Proceedings: Sequim, WA. THE EFFECTS OF DISAGGREGATION WITH PARTIAL PROFILE CHOICE EXPERIMENTS1 Jon Pinnell President & COO MarketVision Research Lisa Fridley Research Manager Marketing Sciences Group MarketVision Research ABSTRACT Recently, hierarchical Bayes has been shown to produce improvements in predictive validity over aggregate logit models. However, some researchers have observed that hierarchical Bayes when used with partial profile choice tasks may decrease predictive validity relative to an aggregate logit analysis. The authors explore the internal predictive validity of partial profile choice tasks with disaggregation, comparing aggregate logit and hierarchical Bayes on several partial profile datasets. THE BENEFITS OF DISAGGREGATION Researchers have long discussed the benefits of considering individual differences. However, many techniques commonly used fail to allow responses to differ across respondents. Two relevant examples include: Market Response Models Historically, market response modeling has been conducted at the aggregate level, possibly by market or store. Hanssens, Parsons and Schultz (1990) comment, “In practice, model building efforts appear to exclude considerations about the level of aggregation, probably because of the constraints surrounding availability of data.” If individual differences are considered in market response modeling, they are most often included via entity specific intercept terms rather than entity specific slope (elasticity) terms. In the 1970s, Markov techniques had been suggested that allow for individual specific elasticity parameters (Rosenberg, 1973), but these methods were not widely adopted. Customer Satisfaction Research Derived importance, as is common with satisfaction, typically produces a regression coefficient for each attribute (or driver). These coefficients are commonly interpreted as 1 The authors wish to thank Andrew Elder of Momentum Research Group, Chris Goglia of American Power Conversion, and Tom Pilon of TomPilon.com, for each kindly sharing datasets for this research. We also wish to thank Ying Yuan of MarketVision Research for producing the simulated choice results. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 151 the importance of that attribute, or put another way, its relative influence on overall satisfaction. Crouch and Pinnell (2001) show that accounting for respondent heterogeneity (in this example using latent class analysis) produces derived importances with substantially improved explanatory power. Researchers using ratings-based conjoint would commonly promote the ability to derive individual level utilities as a key benefit of such techniques. As the research community’s focus shifted towards discrete choice modeling, the goal of individual level utilities was begrudgingly reduced to a nicety. Some researchers attempted a wide variety of solutions to maintain individual level estimation. For example, some researchers have suggested dual conjoint (Huisman) and reweighting (Pinnell, 1994) to leverage the benefit of ratings and choice based techniques. Other advances have provided methods to disaggregate choice data. These include (all from Sawtooth Software): k-logit, latent class, ICE, and hierarchical Bayes. Hierarchical Bayes appears to offer the greatest opportunity. Hierarchical Bayes with Discrete Choice One of the current authors has presented a meta-analysis of commercial and experimental research studies comparing the ability to predict respondents’ choices when those choices were analyzed with aggregate logit compared to hierarchical Bayes (Pinnell, 2000). The purpose of the previous work was to compare improvement of predictive ability when implementing utility balance (UB) relative to improvement of predictive ability when implementing hierarchical Bayes (HB). The discussion of UB is not relevant to the current work, but the findings related to HB certainly are. The following table summarizes the hit rate of several studies comparing the hit rates from aggregate logit to the hit rates using hierarchical Bayes. Study One Study Two Study Three Study Four Study Five Agg. Logit 75.8% 24.8% 60.5% 61.2% 59.2% Hier. Bayes 99.5% 79.5% 62.6% 79.37% 78.8% Improvement 23.8% 54.7% 2.1% 18.1% 19.6% The five studies shown above show substantial improvement in hit rates. However, the paper continued to report on a sixth study. The findings from the sixth study show a very different result from HB. Study Six Agg. Logit 71.9% Hier. Bayes Improvement 68.1% -3.8% The conclusion still is that HB is generally beneficial. However, this one study demonstrates HB can be deleterious, which is troubling. It is not surprising that HB could produce a result that was not clearly superior to aggregate logit. But in the face of heterogeneity and with enough information, we expected HB to provide beneficial results. 152 2001 Sawtooth Software Conference Proceedings: Sequim, WA. There are several possible explanations that could explain this anomalous result. They include: • • • • • Not enough heterogeneity Not enough information in the choice tasks Poor hold-out tasks Confounding effects with UB experiments Partial profile nature of tasks At the time, the last explanation most resonated with our beliefs, but there was little scientific basis for that claim. It is worth noting that the sixth study was the only one of the set that was not full profile. Is there some reason that HB with partial profile choice tasks will behave differently than HB with full profile choice tasks? What is Partial Profile? Partial profile choice designs have emerged as methods of dealing with choice based studies that include a large number of attributes. In a partial profile choice design, respondents are shown, at one time, a subset of the full set of attributes. They are then asked to indicate their preferred concept based on this subset of attributes. This task is then repeated with the subset of attributes changing each time. An illustrative partial profile task is shown below: Full Profile Task Partial Profile Task 1 2 3 1 2 3 A1 B4 C3 D2 E1 F3 G1 A2 B3 C1 D1 E2 F1 G2 A3 B1 C2 D2 E4 F2 G2 A1 D2 F3 A2 D1 F1 A3 D2 F2 Full profile designs, where respondents are exposed to the full set of attributes, work well when the number of attributes is relatively small. However, as the number of attributes becomes large, some researchers are concerned that the ability of respondents to provide meaningful data in a full profile choice task decreases. One particular concern is that respondents may oversimplify the task by focusing on only a few of the attributes. Thus, partial profile choice tasks are designed to be easier for the respondent. Like full profile choice designs, partial profile choice designs allow the researcher to produce utilities either at the aggregate or disaggregate level. Chrzan (1999) has presented evidence that partial profile choice tasks are easier for respondents, and produce utilities with less error, relative to full profile choice tasks. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 153 There is substantial precedent for partial profile tasks. ACA (Adaptive Conjoint Analysis) is a ratings-based conjoint methodology that customizes each task based on each respondent’s prior utilities. ACA also presents tasks that are partial profile. There have been suggestions from some researchers that the partial profile tasks in ACA dampen the utility measures (Pinnell, 1994). However, is there a compelling reason to expect that disaggregating partial profile tasks would behave differently than full profile tasks? DISAGGREGATING PARTIAL PROFILE TASKS Previous research has investigated the effects of disaggregating partial profile choice tasks. The interested reader is referred to the following: Lenk, DeSarbo, Green, and Young; Chrzan; Huber; and Brazell et al. These studies, taken together, show mixed results and fail to provide clear direction regarding the use of HB with partial profile choice tasks. Given the popularity of partial profile and HB, the current work conducts a meta-analysis of several partial profile choice datasets and compares the predictive validity of aggregate logit compared to hierarchical Bayes. Empirical Results Empirical Results: Method We report on the findings of nine different studies. For each study we will report the following design elements: • • • • • • Number of respondents Number of attributes Number of attributes shown per task Number of alternatives per task Number of tasks Number of parameters estimated. For this review we are conducting purely post hoc analysis, so we don’t have consistent holdout tasks. Still, we are required to use some criterion to evaluate the effectiveness of HB with partial profile choice tasks. The criterion we use is hit rates. Without consistent hold-out tasks across all nine studies, even hit rates are difficult to produce. Rather, for each study we held out one task and used all other tasks to estimate a set of utilities using aggregate logit (AL) and hierarchical Bayes. These two sets of utilities were used to calculate hit rates for the one heldout task. This process of holding out one task, estimating two sets of utilities (AL & HB), and calculating hit rates was repeated several times for each study (for a total of eight times). We must point out that we are dealing with randomized choice tasks, so we must use hits as our criterion. We would have preferred to use errors in share predictions as our criterion, but that wasn’t feasible. It is also important to note that the studies analyzed here are a compilation from 154 2001 Sawtooth Software Conference Proceedings: Sequim, WA. multiple researchers, each likely with differing design considerations and expectations for the use of the data. For each of the following studies, we report the hit rates using utilities estimated from aggregate logit and HB. We show the difference in the hit rates and the standard error of the difference. Finally, we show a t-value of the difference testing the null that it is zero. The standard error and t-value take into account the correlated nature of our paired observations. The following studies are arbitrarily ordered from the smallest sample size to the largest. Study 1 Study 1 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 75 11 5 6 3 33 Study 1 Findings Hit Rate Aggregate 64.1% Hit Rate HB 59.6% Difference -4.5% Standard Error .0147 t-ratio -3.05 Study 2 Study 2 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 133 7 4 15 4 22 Study 2 Findings: Hit Rate Aggregate 47.6% Hit Rate HB Difference Standard Error t-ratio 51.7% 4.0% .0170 2.38 Study 3 Study 3 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 166 18 5 20 4 18 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 155 Study 3 Findings: Hit Rate Aggregate 56.0% Hit Rate HB 56.5% Difference 0.6% Standard Error .0105 t-ratio .54 Standard Error .0170 t-ratio .82 Standard Error .0148 t-ratio 7.41 Study 4 Study 4 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 167 9 5 15 4 29 Study 4 Findings: Hit Rate Aggregate 52.2% Hit Rate HB 53.6% Difference 1.4% Study 5 Study 5 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 187 7 5 15 4 23 Study 5 Findings: Hit Rate Aggregate 46.6% Hit Rate HB 57.6% Difference 11.0% Study 6 Study 6 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 156 218 12 4 10 3 27 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Study 6 Findings: Hit Rate Aggregate 73.4% Hit Rate HB 66.5% Difference -6.9% Standard Error .0095 t-ratio -7.24 Standard Error .0060 t-ratio -3.58 Standard Error .0106 t-ratio 14.84 Study 7 Study 7 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 611 17 5 17 3 39 Study 7 Findings: Hit Rate Aggregate 59.2% Hit Rate HB 57.0% Difference -2.1% Study 8 Study 8 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 791 9 5 13 5 35 Study 8 Findings: Hit Rate Aggregate 32.8% Hit Rate HB 48.6% Difference 15.8% Study 9 Study 9 Design: Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated 1699 30 5 16 3 49 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 157 Study 9 Findings: Hit Rate Aggregate 68.0% Hit Rate HB 65.7% Difference -2.3% Standard Error .0040 t-ratio -5.67 Empirical Results: Summary of Findings The following table summarizes the findings from the nine studies. Agg. Logit Hier. Bayes 73.4% 66.5% Diff Stderr t-ratio -6.9% 0.0095 -7.24 HB DOES WORSE HB DOESN’T HELP HB DOES BETTER 64.1% 68.0% 59.2% 56.0% 52.2% 47.6% 46.6% 32.8% 59.6% 65.7% 57.0% 56.5% 53.6% 51.7% 57.6% 48.6% -4.5% 0.0147 -3.05 -2.3% 0.0040 -5.67 -2.1% 0.0060 -3.58 0.6% 0.0105 0.54 1.4% 0.0170 0.82 4.0% 0.0170 2.38 11.0% 0.0148 7.41 15.8% 0.0106 14.84 Of the nine studies in our analysis, we see that HB is detrimental to predictive validity in four of the nine cases, is not beneficial in another two of the nine cases and for only three of the nine cases does HB show an improvement over aggregate logit for individual prediction. Based on the results from Pinnell (2000), we aren’t terribly surprised by these findings, but we do consider them troubling to the use of HB with partial profile choice tasks. Next, we explore a number of elements that might explain the findings. Possible Explanations There are several possible explanations to the previous findings. We explore the following three possibilities: • • • Is there enough heterogeneity for disaggregation to be beneficial? Do specific design considerations limit HB’s ability? Are partial profile choice tasks susceptible to overfitting? Each is discussed below. Heterogeneity As discussed above, in the absence of heterogeneity, one would not expect HB to outperform an aggregate model. However, it is unlikely that preferences are homogeneous in most categories. One might expect for any given study, the higher the hit rate from aggregate logit, the more homogeneous the population under investigation. It isn’t clear this is true in a metaanalysis such as this, for the following reasons. First, the magnitude of hit rates is affected by the amount of dominance in the hold-out tasks. As the level of dominance increases, so will the hit rates. 158 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Second, hit rates from hold-out tasks with two alternatives will have much higher expected hit rates than hold-outs with more alternatives. For example, a hold-out hit rate of 52% from pairs would not be impressive (the level of chance is 50%). However, if the hold-out tasks were of five alternatives, a hold-out hit rate of 40% would be meaningful (twice the level of chance, which would be 20%). Not unrelated to the points outlined above, some might be concerned that there is a headroom problem. Recall, however, that in the full profile case reported previously and summarized above, HB was able to improve on hit rates in the mid 70 percent range. To further explore whether a lack of heterogeneity might explain the findings of HB, we explore one study in more detail. Specifically, we explore the study with the best hit rate from aggregate logit. Recall this study had the following design characteristics and results. Number of respondents Number of attributes Number of attributes shown Number of tasks Number of alternatives Number of parameters estimated Hit Rate Aggregate 73.4% Hit Rate HB 66.5% Difference -6.9% 218 12 4 10 3 27 Standard Error .0095 t-ratio -7.24 To investigate the existence of heterogeneity, we needn’t rely on HB alone. Rather, for this study we had additional information2 about each respondent outside the choice questions. From this additional information we are able to form three segments of respondents on the attributes that were also in the choice study. The aggregate model produced the following model fit: -2LL = 3480 By accounting for scale differences (Swait and Louviere) between the three segments, the best improvement in fit we could accomplish was: -2LL = 3448 2 Effectively, we had prior estimates of each individual’s utilities, as are collected in an ACA interview. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 159 However, by evaluating the model fit of three independent logit models, we can evaluate if there are heterogeneous preferences. Model fits expressed as –2LL are additive, so the model fit of the three independent models combined is: -2LL = 3364 This suggests (p < .01) there is clearly heterogeneity between respondents in the dataset. We conclude that the HB with partial profile didn’t produce a deleterious result solely due to a lack of heterogeneity. It might be the case that there just wasn’t enough information with the partial profile choice tasks. The amount of information available from choice tasks is based on a number of design considerations. DESIGN CONSIDERATIONS We explored the relationship between several design characteristics and the t-ratio of the improvement due to HB. The design characteristics we considered include the following. • • • • • • • Number of respondents Number of attributes in the study Number of attributes shown in each task Number of alternatives per task Number of tasks Total number of parameters estimated Total number of observations3 per parameter estimated We evaluate the relationship between each design characteristic and the t-statistic of the improvement due to HB based on Kendall’s tau (τ). The Kendall’s tau and probability reported are the simple averages across nine samples drawn from our nine studies, with each sample draw excluding one study. The results (Kendall’s tau) are shown in the following table. Characteristic # of Respondents # of Attributes # of Attributes shown # of Alternatives/Task # of Tasks # of Parameters # of Observations/Parameter τ p value -0.056 -0.572 0.178 0.817 -0.087 -0.222 0.457 0.73 0.06 0.48 0.01 0.76 0.48 0.13 The one significant design characteristic we believe deserves special attention is the finding that as the number of alternatives per task increases, the performance of HB improves (relative to 3 Observations is defined as follows: (# of attributes shown X # of alternatives per task X # of tasks). 160 2001 Sawtooth Software Conference Proceedings: Sequim, WA. aggregate logit). One hypothesis stated above is that, even with respondent heterogeneity, the ability of HB might be limited based on the amount of information available in the choice tasks. Previous research has shown the statistical benefit of increasing the number of alternatives per task. (Pinnell and Englert, 1997). Besides number of alternatives per task, none of the findings related to design characteristics are significant (α = 0.05). Finally, we explore if partial profile tasks might be more susceptible to overfitting, and if HB might exacerbate that result. Are We Overfitting with PP, HB with PP To explore whether we might be overfitting with HB on partial profile tasks, we produced synthetic data. Using Monte Carlo simulations we produce full profile and partial profile tasks and we simulate respondents’ choices with varying amounts of error and heterogeneity. The design of the data included 10 attributes, 5 of which were shown in the partial profile tasks. In conducting these simulated choices, we sought to time equalize the full profile tasks relative to the partial profile tasks so 4 full profile tasks and 10 partial profile tasks were simulated for 600 respondents. All tasks relied on randomized designs. We explore utilities derived from three sources: • • • Aggregate logit estimation from full profile choice tasks Aggregate logit estimation from partial profile choice tasks Hierarchical Bayes estimation from partial profile choice tasks Each approach’s ability to recover known parameters can be expressed via a correlation between the derived utilities and the known utilities. These are shown in the following table: Pearson r Agg. Logit, Full Profile Agg. Logit, Partial Profile Hier. Bayes, Partial Profile 0.86 0.84 0.77 Correlation with Known Utilities When we compare the full profile utilities (from simulated utilities) to the partial profile utilities we see that the utilities from partial profile tasks have a much larger scale than the full profile utilities, as shown below. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 161 Comparison of Utilities Full Profile (Agg. Logit) vs. Partial Profile (Agg. Logit) 4.00 Partia l Prof ile y=x 3.00 2.00 1.00 Full Profile/ Time Equalized -4.00 -2.00 0.00 0.00 -1.00 2.00 4.00 -2.00 -3.00 -4.00 A casual inspection suggests that the utilities derived from the partial profile choice tasks have a scale nearly twice those from full profile. It might not be surprising that our synthetic partial profile utilities have a larger scale than full profile utilities. This finding confirms previous research (Chrzan, 1999) in which utilities estimated from partial profile tasks were shown to have larger scale than utilities estimated from full profile tasks. However, our understanding of previous research is that this improvement was always attributed to partial profile based solely on an easier respondent task. This finding suggests the difference between full profile and partial profile might be systemic and not limited to differential information processing in humans. It raises the question: can too much scale be a bad thing? Traditionally, higher scale has been represented as a good thing suggesting a decrease in error. Could higher scale represent overfitting? No amount of analysis on utilities can be as informative as exploring how well different sets of utilities can be used for prediction. For the meta-analysis above we used hit rates to gauge effectiveness in individual level prediction. We commented that we would have preferred to measure errors in aggregate share prediction. Given the synthetic nature of this final data set, we were able to simulate choices to fixed hold-out tasks. We report on the errors in prediction when predicting shares of hold-out choices. Specifically, we show the mean squared errors, averaged across multiple simulations. 162 2001 Sawtooth Software Conference Proceedings: Sequim, WA. The errors under the three scenarios are summarized below: Errors Predicting Hold-Out Shares Error Agg. Logit, Full Profile Agg. Logit, Partial Profile Hier. Bayes, Partial Profile 7.17% 8.44% 11.52% These findings would suggest that partial profile with aggregate logit or with HB has more error than full profile aggregate logit, and about 60% more for the partial profile with HB scenario. However, we know that the partial profile utilities had a larger scale parameter, and this difference might impact their ability to produce accurate share estimates. To account for this scale difference, we reanalyzed the share predictions from each of the three sets of utilities, this time attenuating for scale differences. The errors in share predictions, after attenuating for scale differences, are summarized below: Errors Predicting Hold-Out Shares After Attenuating for Scale Differences Error Agg. Logit, Full Profile Agg. Logit, Partial Profile Hier. Bayes, Partial Profile 0.69% 3.53% 8.82% Attenuating for scale fails to improve the performance of partial profile results. In fact, relative to full profile aggregate logit4, partial profile’s performance deteriorates. 4 As well as full profile with HB estimation, though this finding is beyond the scope of the current work. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 163 CONCLUSIONS We had previously found HB to be beneficial with full profile choice tasks, and often impressively so. Through this investigation, and others, we have found nothing to cause us to question this conclusion. However, in the past we have seen unreliable results with partial profile choice tasks and HB estimation of individual part-worths. In our meta-analysis of nine commercial datasets, we find that HB, when used with partial profile choice tasks improves prediction in individual level hit rates relative to aggregate logit in only three of the nine studies examined. At the same time, we find that in four of the nine studies HB used to estimate individual part-worths from partial profile choice tasks produced significantly inferior hit rates relative to aggregate logit. We also explore the impact of using HB with partial profile tasks through synthetic data and Monte Carlo simulations. We find that utilities derived from partial profile tasks with aggregate logit had nearly twice the scale of utilities derived from full profile tasks with aggregate logit. This finding is not inconsistent with other research, but our conclusion is different than other research. As far as we know, all previous research has concluded that the higher scale of partial profile tasks was a result of a simplified respondent task. Given that our synthetic data produces the same findings without respondents, we conclude the scale is not a purely respondent based phenomenon. Rather, we hypothesize that partial profile tasks are susceptible to overfitting. We explore errors in share prediction of hold-out tasks. Even attenuating for scale differences, we show that both partial profile with aggregate logit and individual part-worths from partial profile with HB have much higher error than full profile with aggregate logit. 164 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Brazell, Jeff, William Moore, Christopher Diener, Pierre Uldry, and Valerie Severin (2001), “Understanding the Dynamics of Partial Profile Application in Choice Experiments,” AMA Advanced Research Techniques Forum; Amelia Island, FL. Chrzan, Keith (1999), “Full versus Partial Profile Choice Experiments: Aggregate and Disaggregate Comparisons,” Sawtooth Software Conference; San Diego, CA. Crouch, Brad and Jon Pinnell (2001), “Not All Customers Are Created Equally: Using Latent Class Analysis To Identify Individual Differences,” Working Paper, MarketVision Research; Cincinnati, OH. Hanssens, Dominique, Leonard Parsons, and Randall Schultz (1990), Market Response Models: Econometric and Time Series Analysis. Kluwer Academic Publishers: Boston. Huber, Joel (2000), “Projecting Market Behavior for Complex Choice Decisions,” Sawtooth Software Conference, Hilton Head, SC. Huisman, Dirk (1992), “Price-Sensitivity Measurement of Multi-Attribute Products,” Sawtooth Software Conference; Sun Valley, ID. Lenk, Peter, Wayne DeSarbo, Paul Green and Martin Young (1996), “Hieararchical Bayes Conjoint Analysis: Recovery of Partworth Heterogeneity from Reduced Experimental Designs,” Marketing Science, 15, 173-191. Pinnell, Jon (1994), “Multistage Conjoint Methods to Measure Price Sensitivity,” AMA Advanced Research Techniques Forum; Beaver Creek, CO. Pinnell, Jon and Sherry Englert (1997), “Number of Choice Alternatives in Discrete Choice Modeling,” Sawtooth Software Conference; Seattle, WA. Pinnell, Jon (2000), “Customized Choice Designs: Incorporating Prior Knowledge and Utility Balance in Discrete Choice Experiments,” Sawtooth Software Conference; Hilton Head, SC. Rosenberg, Barr (1973), “The Analysis of a Cross-Section of Time Series by Stochastically Convergent Regression,” Annals of Economic and Social Measurement, 399-428. Swait, Joffre and Jordan Louviere (1993), “The Role of the Scale Parameter in the Estimation and Comparison of Multinomial Logit Models.” Journal of Marketing Research, Vol. 30 (August), 305-14. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 165 166 2001 Sawtooth Software Conference Proceedings: Sequim, WA. ONE SIZE FITS ALL OR CUSTOM TAILORED: WHICH HB FITS BETTER? Keith Sentis & Lihua Li Pathfinder Strategies1 INTRODUCTION As most of you know from either your own experience with Hierarchical Bayes Analysis (HB) or from reports by colleagues, this relatively new analytic method offers better solutions to a variety of marketing problems. For example, HB analyses yield equivalent predictive accuracy with shorter questionnaires when estimating conjoint part-worths (Huber, Arora & Johnson, 1998). HB also gives us estimates of individual utilities that perform well where before we would have had to settle for aggregate analyses (Allenby and Ginter, 1995; Lenk, DeSarbo, Green & Young, 1996). HB methods achieve an “analytical alchemy” by producing information where there is very little data – the research equivalent of turning lead into gold. This is accomplished by taking advantage of recently developed analytical tools (the Gibbs sampler) and advances in computing speed to estimate a complex two-level model of individual choice behavior. In the upper level of the model, HB makes assumptions about the distributions of respondents’ vectors of part-worths. At the lower level of the model, HB assumes a logit model for each individual. The analytical alchemy results from using information from the upper level to assist with the fitting of the lower level model. If a given respondent’s choices are well estimated from his own data, the estimates of his part-worths are derived primarily from his own data in the lower level and depend very little on the population distribution in the upper level. In contrast, if the respondent’s choices are poorly estimated from his own data, then his part-worths are derived more from the distributions in the upper level and less from his individual data in the lower level. Essentially, HB “borrows” information from the entire sample to produce reasonable estimates for a given respondent, even when the number of choices made by the respondent is insufficient for individual analysis. This process of “borrowing” information from the entire sample to assist in fitting individual level data requires considerable computational “grunt” and is a potential barrier to widespread use of HB methods. However, the rapid increase in computer speed and some of our own work (Sentis & Li, 2000) that identified economies achievable in the analysis, have made HB analysis a viable tool for the practitioner. The focus of our paper today is a particular aspect of how HB “borrows” information to fit individual level data. I mentioned a moment ago that the upper level model in HB makes some assumptions about the distribution of vectors of part-worths in the population. In the simplest case, the upper level model assumes that all respondents come from the same population distribution. More complex upper level models make further assumptions about the nature of the population. For example, the upper level model may allow for gender differences in choice behavior. Of course, these more complex upper level models require more parameters and additional computational grunt. 1 The authors thank Rich Johnson for helpful comments. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 167 In popular implementations of HB for estimating conjoint part-worths such as the Sawtooth HB modules, the upper level model is simple. All respondents’ vectors of part-worths are assumed to be normally distributed. That is, all respondents’ choices are assumed to come from a single population of choice behavior. On the face of it, this assumption runs counter to much work done in market segmentation. Indeed, the fundamental premise of market segmentation is that different segments of respondents have different requirements which are manifest as different patterns of choice behavior. Ostensibly, this demand heterogeneity enables differentiated product offerings, niche strategies and effective target marketing efforts. This view was first posited by Smith (1956) who defined market segmentation as making product decisions by studying and characterizing the diversity of wants that individuals bring to a market. Our paper examines what happens when HB analyses are allowed to “borrow” information from more relevant subpopulations. The idea was to attempt to improve our predictions by “borrowing” information from a more appropriate segment of respondents rather than borrowing from the entire sample. Consider this analogy. If you want to buy a new suit, there are two ways to proceed. You can shop for an off-the-rack suit and hope or assume that your particular body shape fits within the distribution of body shapes in the population. Alternatively, you can have a suit customtailored to your exact shape. Custom tailoring will almost always yield a better look than the “one-size-fits-all” alternative. This custom tailoring yields a better fit but is more costly. Similarly, in our current project, we explored whether custom tailoring HB utilities within segments yields a better fit. That is, we explored whether better fitting models can be achieved by having the analysis “borrow” information from a more appropriate base than the entire sample — namely segments of the sample. We do not have access to HB software that allows complex upper level models. Instead, we used the Sawtooth HB CBC module to explore the impact on predictive accuracy from first dividing respondents into groups with presumably similar choice patterns and then estimating the utilities separately for each group. We compared the predictive accuracy of HB utilities derived from the entire sample to those derived from within a priori segments and also from within latent segments. To customize the HB utilities in this way requires more effort and therefore increases the cost of the analysis. Keeping with our sartorial analogy, our paper poses the following question: These custom-tailored HB analyses have a higher price tag but do they yield a nicer fit? 168 2001 Sawtooth Software Conference Proceedings: Sequim, WA. APPROACH We took a simple-minded approach to this question. First, we computed HB utilities using the entire sample. Actually, we computed three separate sets of utilities to reduce any random jitters in the results. Then we calculated the hit rates for hold out tasks using the three sets of utilities and we averaged the three hit rates. Next, we divided the sample into segments –either a priori segments or latent segments – and computed HB utilities within each segment. Then we calculated the hit rates for the same holdouts using the within-segment utilities. We computed three sets of HB utilities within each segment and averaged the hit rates as we did for the entire sample analyses. Then we compared the hit rates based on the total sample utilities with the hit rates based on the within-segment utilities. Here is a summary of our approach: Step 1: Compute HB utilities using entire sample • 3 separate sets Step 2: Calculate hit rates for hold outs using three separate sets of utilities • average the three hit rates Step 3: Divide sample into segments (a priori or latent) and compute HB utilities within each segment • 3 separate sets in each segment Step 4: Calculate hit rates for hold outs within each segment using three separate sets of utilities • average the three hit rates within each segment Step 5: Compare the mean hit rates Results The first dataset we looked at had these characteristics: • business to business study • 280 respondents • tasks = 16 o 4 concepts plus NONE • holdouts = 2 • attributes = 10 • partial profile design This study focused on a range of farm enterprises that were engaged in quite different farm activities. Some of these farms produced fine Merino wool and some produced fine chardonnay grapes. The range of farm enterprises broke into three broad industry sectors and we used these industry sectors as a priori segments. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 169 The results of our analyses are shown on this graph. Each of the points is the mean of the hit rates from three separate sets of utilities. It would be safe to summarize this slide as “Custom tailoring does not yield dramatically better fits.” Industry Sectors Hit Rates 0.6 Sector 1 Sector 3 0.5 Sector 2 0.4 Total Sample Within Segment Total Sample Within Segment Total Sample Within Segment We were somewhat surprised by this and decided to explore what happens to the fit when latent segments are defined on the basis of different choice patterns. We examined three latent segments that we had identified within this same dataset. Three segments were defined using KMEANS clustering of the HB utilities from the total sample. These three segments comprised 40%, 34% and 26% of the farming enterprises and had all of the characteristics that we like to see when we conduct segmentation projects. They looked different, they made sense, they were statistically different and most importantly, the client gave us a big head nod. This graph shows the relative importance of the attributes for each of the segments. We have highlighted three of the attributes to demonstrate the differences in the pattern across the segments. These differences meet the usual significance thresholds for both univariate and multivariate tests. 170 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Feature Importance by Segment Segment 1 Segment 2 Price Segment 3 Brand Use Brand Price Use Brand Use Price The next graph shows how much better the fit is when we customize the HB runs to borrow information from only the most relevant segment. Once again, custom-tailoring the HB utilities do not yield better fits. Clustering Segments – HB Utilities Hit Rates 0.6 0.5 0.4 Total Sample Within Segment Total Sample Within Segment Total Sample 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Within Segment 171 We thought that perhaps an alternative segmentation method would yield more expected results. So we ran a Latent Class segmentation using the Sawtooth LClass module to define two segments that comprised 53% and 47% of the sample. These segments are similar to the ones we found using the KMEANS method and they do exhibit a different pattern of attribute importance scores. Feature Importance by Segment Segment 1 Segment 2 Brand Price Use Price Brand Use The results are shown here. Again, the within-segment hit rates were not any better than those from the total sample. L Class Segments Hit Rates 0.6 Segment 2 0.5 Segment 1 0.4 Total Sample 172 Within Segment Total Sample Within Segment 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Undeterred by these unexpected results, we continued using this approach to examine six other datasets. These additional datasets were from business to business studies as well as FMCG studies. The sample sizes ranged from 320 to 800, the number of tasks ranges from 11 to 20, the number of attributes ranged from 4 to 7 with both full profile and partial profile designs. On these datasets, we examined a priori segments as well as latent segments derived using KMEANS and L Class methods. Across the six datasets, we examined latent segment solutions with between two and seven segments. In some instances, there were slight improvements in the within-segment hit rates and in some instances the obverse result obtained. The graph below shows the results across the seven datasets. On the left is the mean hit rate from the 21 sets of HB utilities based on the total samples in our seven datasets. On the right is the mean hit rate from the 222 sets of HB utilities that were customised for the various segments. Even blind Freddy can see that the null hypothesis does not get much nuller than this. This graph illustrates that the effort and expense of custom-tailoring 222 sets of utilities yields a fit that is no better than the 21 “off-the-rack” utilities. The finding across the seven datasets can be stated quite simply: • there is no consistent improvement in predictive accuracy when going to the trouble to compute HB utilities within segments Average Hit Rates Hit Rates 0.60 21 Sets of Utilities 222 Sets of Utilities 0.55 0.50 Total Sample Within Segment So after all of this computation, Lihua and I were faced with a good news – bad news scenario. The good news is that the time and effort associated with customising HB to produce within-segment utilities does not appear to yield anything worthwhile. Therefore, we can run only the total sample analyses and then head for the surf with our sense of client commitment fully intact. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 173 The bad news is that our worldview on market segments had been severely challenged. In discussing our findings with colleagues, we encountered a continuum of reaction. This continuum was anchored at one end by responses like “that’s so implausible you must have an error in your calculations”. At the other end of the spectrum, we heard reactions like “just what I expected, segmentation is actually like slicing a watermelon” or “social science data is usually one ‘big smear’ that we cut up in ways that suit our needs”. Returning to our analogy about buying a suit for a moment, suppose we were to attempt to segment the buyers of suits using a few key measurements like length of sleeve, length of inseam, waist size and so forth. In this hypothetical segmentation exercise, we would expect to identify at least two segments of suit buyers. One segment would cluster around a centroid of measurements that is known in the trade as “42 Long”. The members of this segment are more than 6 feet tall and reasonably slim. Another segment likely to emerge from our segmentation is known as “38 Short”. Members of this segment tend to be vertically challenged but horizontally robust. Despite the fact that members of the 42 Long segment look very different from the members of the 38 Short segment, they all buy their suits off the rack from a common distribution of sleeve lengths, inseams and waist measurements. In examining the literature more broadly, we came across other findings that are similar to ours. For example, Allenby, Arora and Ginter (1998) examined three quite different datasets looking for homogenous segments. They did not find convincing evidence of homogeneity of demand: • “For all parameter estimates in the three datasets, the extent of within-component heterogeneity is typically estimated to be larger than the extent of across-component heterogeneity, resulting in distributions of heterogeneity for which well defined and separated modes do not exist. In other words, across the three data sets investigated by us, a discrete approximation did not appear to characterize the market place completely or accurately.” In the aftermath of this project, Lihua and I have come to revise our worldview on market segments by embracing the “watermelon theory”. And as is often the case when one’s fundamentals are challenged, our revisionist view of market segments is a more comfortable one. So while we set out to find nicer fitting HB utilities, we ended up with a better fitting view of market segments. 174 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Allenby, G. M.; Arora, N; and Ginter J. L (1998) “On the heterogeneity of demand.” Journal of Marketing Research 35, 384-389. Allenby, G. M. and Ginter J. L. (1995) “Using Extremes to Design Products and Segment Markets.” Journal of Marketing Research, 32, 392-403. Huber, J., Arora, N. and Johnson, R. (1998) “Capturing Heterogeneity in Consumer Choices.” ART Forum, American Marketing Association. Lenk, P. J., DeSarbo, W. S., Green, P. E. and Young, M.R. (1996) “Hierarchical Bayes Conjoint Analysis: Recovery of Partworth Heterogeneity from Reduced Experimental Designs.” Marketing Science, 15, 173-191. Sentis, K. and Li, L. (2000) “HB Plugging and Chugging: How Much Is Enough.” Sawtooth Software Conference Proceedings, Sawtooth Software, Sequim. Smith, W. (1956) “Product Differentiation and Market Segmentation as Alternative Marketing Strategies.” Journal of Marketing, 21, 3-8. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 175 176 2001 Sawtooth Software Conference Proceedings: Sequim, WA. MODELING CONSTANT SUM DEPENDENT VARIABLES WITH MULTINOMIAL LOGIT: A COMPARISON OF FOUR METHODS Keith Chrzan ZS Associates Sharon Alberg Maritz Research INTRODUCTION In many markets, customers split their choices among products. Some contexts in which we often see this sort of choosing are the hospitality and healthcare industries, fast moving consumer goods and business-to-business markets. • • • • In the hospitality industry, for example, business travelers may not always stay in the same hotel, fly the same airline, or rent from the same automobile rental agency. Likewise, most people include two or more restaurants in their lunchtime restaurant mix, eating some of their lunches at one, some at another and so on. Similarly, many physicians recommend different brands of similar drugs to different patients. Outside of pharmaceuticals, buyers and recommenders of medical devices and medical supplies also allocate choices. In fast moving consumer goods categories, consumers may split their purchases among several brands, buying and using multiple brands of toothpaste, breakfast cereal, soda and so on. Finally, many of the choices that are “pick one” for individual consumers are allocations for business-to-business. Consider PC purchases by corporate IT departments. In these cases, and in many others, it may make more sense to ask respondents to describe the allocation of their last 10 or next 10 purchases than to ask which one brand they chose last or which one brand they will choose next. Some controversy attends the modeling of these allocations. They are measured at ratio level, so regression may work. Moreover, because the allocations are counts, Poisson regression comes to mind. On the other hand, the predictors of these counts are not just a single vector of independent variables but a vector of predictors for each alternative in the allocation. This latter consideration suggests a multinomial logit solution to the problem. Although multinomial logit typically has one alternative as chosen and the remainder as not chosen, there are several possible ways of modeling allocation data. Four ways to model constant sum dependent variables with multinomial logit are described and tested below. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 177 Method 1: Winner Takes All (WTA) One could simplify the allocation data and model as if the respondent chooses the one alternative with the highest point allocation and fails to choose the rest. Method 2: Simple Dominance (SD) Another method is to recognize and model preference inequalities. This method is the extension to constant sum data of the method recommended by Louivere, Hensher and Swait (2001) for rating scale data. For example, in a constant sum allocation of his next 10 choices in a given category, Smith allocates 5 points to Brand A, 2 each to Brands B and C, 1 to D and none to E or F. These inequalities are implicit in Smith’s allocation: A>(B, C, D, E, F) B>(D, E, F) C>(D, E, F) D>(E, F) Thus one could turn Smith’s allocation into four choice sets corresponding to the four inequalities above: Set 1: Set 2: Set 3: Set 4: Smith chooses A from the set A-F Smith chooses B from the set B, D-F Smith chooses C from the set C, D-F Smith chooses D from the set D, E, F. Method 3: Discretizing the Allocation (DA) A third way to set up the estimation model involves seeing the 10 point allocation as 10 separate reports of single-brand choosing. Thus there are 10 “choices” to be modeled from the set A-F. In five of these, Smith chooses A over B-F; in two he chooses B over A, C, D, E and F; in two he chooses C over A, B, D, E and F; and in one he chooses D over A, B, C, E and F. Discretizing the allocation is the method for handling constant sum dependent variables in SAS (Kuhfeld 2000). Method 4: Allocation-Weighted Dominance (WD) One could combine Methods 2 and 3 above simply by weighting the four dominance choice sets from Method 2 by the observed frequencies from the allocation: Thus the Set 1 (Smith chooses A from the set A-F) gets a weight of 5 Set 2 (Smith chooses B from the set B, D-F) gets a weight of 2 Set 3 (Smith chooses C from the set C, D-F) gets a weight of 2 Set 4 (Smith chooses D from the set D, E, F) gets a weight of 1 178 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Possible Biases Winner Takes All ignores much of the information provided by respondents about the strength of their preferences; containing the least information, we expect it will do least well in predicting shares. To the extent respondents gravitate toward extreme preferences (100% for a single brand) Winner Takes All should perform better. Dominance Modeling makes use of more of the information but only insofar as the allocation data provides a rank-order preference of the alternatives. Whereas Louviere, Hensher and Swait (2001) use this method to make the most of the preference information contained in rating scale data, in the present case it ignores the magnitude of preference information provided by the constant sum metric. For this reason, Dominance Modeling also may not predict the allocation shares well. Discretizing the Allocation may result in shares that are too flat. The reason is that the same independent variables will predict different dependent variable outcomes and the noise this creates will depress the multinomial logit coefficients. Thus shares of more and less preferred alternatives will be less distinct. Allocation Weighted Dominance makes use of more of the information than Dominance Modeling and Winner Take All. No observations from a single respondent predict different outcomes from an identical pattern of independent variables, so multinomial logit utilities will not be muted. This may produce share estimates that are too spiky. Of the four methods, both Simple Dominance and Weighted Dominance involve substantial programming for data setup. Winner Takes All requires much less programming and Discretized Allocation almost none at all. EMPIRICAL TESTING Using three empirical data sets (two from brand equity studies, one from a conjoint study) we will test whether these four analytical approaches yield the same or different model parameters; if different, we will identify which performs best. It may be that the four models do not differ significantly in their model parameters, or that they differ only in the scale factors, and not in the scale-adjusted model parameters. For this test we employ the Swait and Louviere (1993) test for equality of MNL model parameters. If a significant difference in model parameters results, then it will make sense to discover which model best predicts observed allocation shares. In case of significantly different model parameters, testing will proceed as follows. Under each model, each respondent’s predicted allocation will be compared to that brand’s observed allocation for that respondent. The absolute value of the difference between actual and predicted share for each brand will be averaged within respondent. This mean absolute error or prediction (MAE) will be our metric for comparing the four models. We use a comparable metric for aggregate share predictions. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 179 Empirical Study 1 Data Set A hospitality provider commissioned a positioning study in which 247 respondents evaluated the client and three competitors on 40+ attributes. Factor analysis reduced this set of attributes to 13 factors, and factor scores representing these factors are the predictor variables in the model. Respondents’ report of the share of their last 10 shopping occasions is the constant sum dependent variable. Analysis We used the Salford Systems LOGIT package to estimate the MNL models (Steinberg and Colla 1998). Table 1 Raw Model Parameters for Study 1 Parameter 1 2 3 4 5 6 7 8 9 10 11 12 13 Winner Takes All .65 -.43 -.09 .02 .59 -.06 .08 .46 .05 .02 -.05 .15 .35 Simple Dominance .36 -.25 -.03 .14 .37 .00 -.03 .11 .04 .03 .07 .01 .22 Discretized Allocation .50 -.30 -.05 .13 .33 .02 .05 .24 .13 .05 .02 .09 .25 Weighted Dominance .51 -.31 -.05 .11 .38 .02 .06 .23 .13 .06 .03 .07 .25 Table 1 shows the model parameters for the four models. The various manipulations necessary to conduct the Swait and Louviere test reveal that the scale factors for the four models were, setting WTA as the reference: WTA SD DA WD 180 1.000 0.583 0.765 0.785 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 2 Scale-Adjusted Parameters for Study 1 Parameter 1 2 3 4 5 6 7 8 9 10 11 12 13 Winner Takes All .65 -.43 -.09 .02 .59 -.06 .08 .46 .05 .02 -.05 .15 .35 Simple Dominance .62 -.42 -.05 .24 .63 .00 -.04 .20 .06 .06 .12 .02 .38 Discretized Allocation .65 -.39 -.07 .18 .43 .03 .06 .31 .17 .07 .02 .12 .33 Weighted Dominance .65 -.39 -.06 .14 .49 .02 .08 .29 .16 .08 .04 .09 .31 Scale-adjusted model parameters appear in Table 2. The coefficients for WTA are larger than for the other models and those for SD are smaller, with WD and DA in between. Unadjusted, WTA will produce the most spiky shares and SD the most flat shares. But we need to test to make sure it makes sense to look at the unadjusted shares. The log likelihoods for the four individual models were: WTA SD DA WD -237.132 -286.741 -264.193 -261.487 The other two log likelihoods needed for the Swait and Louviere test are the log likelihood of the data set that concatenates the above four models (-1058.374) and the scale-adjusted concatenated data set (-1051.804). The omnibus test for the difference in model parameters has a χ2 of 4.502; with 42 degrees of freedom (13 parameters plus one for each model past the first) this is not even close to significant. This means that the models are returning non-significantly different parameters (utilities) after adjusting for differences in scale. The test for the difference in scale, however, is significant (χ2 of 13.14; with three degrees of freedom, p<.005). Results Together the results of these two tests mean that the only significant difference between the parameters from the four models is their scale. One can thus use any of the four models if one first adjusts the parameters by a multiplicative constant to best fit observed (disaggregate) shares. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 181 Empirical Study 2 Data Set 366 health care professionals completed a telephone-based brand equity study. The subject was a medical diagnostic category with five major competitors who accounted for over 95% of category sales. Each respondent rated all of the five brands with which she was familiar on a total of 10 attributes. Respondents also reported what percentage of their purchases went to each of the five brands. Unlike Study 1, this time a large majority of respondents allocated all of their usage to a single brand – nearly 90%. We expect that this might make the models more similar than in Study 1, as all are more likely to resemble Winner Takes All. Analysis Again we used the Salford Systems LOGIT package to estimate the MNL models (Steinberg and Colla 1998). Coefficients were very similar, as were the scale parameters: 1.00, 1.10, 1.02 and 1.14 for Winner Takes All, Simple dominance, Discretized Allocation and Weighted Dominance, respectively. Results For this data set, neither the test for differences in parameters (χ2 of 7.272 with 33 degrees of freedom) nor the test for difference in scale (χ2 of 1.142 with 3 degrees of freedom) is significant. Unlike Study 1, the non-significant differences in coefficients lack even the appearance of being different, because even the models’ scales are not significantly different. Empirical Study 3 Data Set As part of a strategic pricing study, 132 hospital-based purchase influencers completed a phone-mail-phone pricing experiment for a category of disposable medical products. Respondents completed 16 experimental choice sets each from a total design of 32. Each set contained three or four brands at varying prices, so that both brand presence and price are manipulated in the experiment. Also appearing in each choice set is an “other” alternative worded as “any other brand of <widgets> at its usual price.” A nice compromise between Studies 1 and 2, this time about half of respondents’ choices are allocations of 100% to a single alternative in a choice set. Analysis Again we used the Salford Systems LOGIT package to estimate the MNL models (Steinberg and Colla 1998). As in Study 2, neither the coefficients (χ2 of 4.58 with 18 degrees of freedom) nor the scale factors (χ2 of 5.764 with 3 degrees of freedom) differ significantly. 182 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Results As in the previous studies, it does not matter which MNL formulation one uses, as the coefficients are not significantly different. As in Study 2, not even the scale parameters differ significantly. DISCUSSION The four MNL formulations produce coefficients that do not differ significantly from one formulation to another. A utility scale difference occurred in just one of the three studies. It is common to have to calibrate utilities in order to have the best fit between MNL share simulations and actual shares, so it is not clear that any of the formulations is superior to the others. For ease of programming and consistency with software providers’ recommendations, discretized allocation is probably the best way to analyze constant sum data. While we have shown that constant sum dependent variables do lend themselves to analysis via MNL, it is not clear that this is always a good idea. Sometimes we use constant sum dependent variables because we know that there is taste heterogeneity and variety seeking within respondents (say in choice of food products). Other times we use constant sum dependent variables when the heterogeneity is influenced by situational factors. A physician may prescribe different antidepressant drugs to different patients because of differences (age, sex, concomitant conditions, concurrently taken drugs) in the patients, not because of any variety seeking on the doctor’s part. Or again, one might eat lunch at McDonald’s one day and Bennigans the next because one day I’m on a short break and one day I have more time, or because one day I am taking the kids to lunch and the next I am going with office mates. When the source of the heterogeneity is situational, it probably makes more sense to model the situational effect directly using a hybrid logit model, one wherein we model choice as a function of attributes of the things chosen (conditional MNL) and as a function of attributes of the chooser or the situation (polytomous MNL). 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 183 REFERENCES Kuhfeld, Warren (2000) “Multinomial Logit, Discrete Choice Modeling: An Introduction to Designing Choice Experiments, and Collecting, Processing and Analyzing Choice Data,” in Marketing Research Methods in the SAS System, SAS Instutute. Louviere, Jordan J, David A. Hensher and Joffre D. Swait (2001), Stated Choice Methods: Analysis and Application, Cambridge: Cambridge University Press. Steinberg, Dan and Phillip Colla (1998) LOGIT: A Suplementary Module by Salford Systems, San Diego: Salford Systems. Swait, Joffre and Jordan Louviere (1993) “The Role of the Scale Parameter in the Estimation and Comparison of Multinomial Logit Models,” Journal of Marketing Research, 30, 315-314. 184 2001 Sawtooth Software Conference Proceedings: Sequim, WA. DEPENDENT CHOICE MODELING OF TV VIEWING BEHAVIOR Maarten Schellekens McKinsey & Company / Intomart BV INTRODUCTION Conjoint choice analysis is ideally suited to model consumer choice-behavior in situations where consumers need to select their preferred choice-alternative out of a number of choicealternatives. In many instances, conjoint choice models are more versatile than traditional conjoint approaches. Complications, like the similarity of choice-alternatives (the IIA-problem) have largely been solved by using nested choice models and segmentation models (latent class analysis, hierarchical bayes). Furthermore, conjoint choice analysis has considerable advantages over traditional conjoint, given its ability to include context effects and alternative-specific attributes, and to directly model choice-behavior without the need for additional assumptions on how to translate utilities into choices. One issue that has not completely been resolved yet, is the simultaneous modeling of a set of interdependent choices in conjoint choice analysis. How do we need to apply conjoint choice modeling in situations in which consumers make a number of interrelated choices? We see this phenomenon in the context of shopping behavior and entertainment. Take for example the process of buying a car: in normal conjoint approaches we model the add-on features, such as a roof, spoilers, cruise-control and automatic gear as attributes of the choice-alternative ‘car’, in order to derive their utility-values. In reality, these ‘attributes’ are not real attributes, but choicealternatives in their own right. However, there is a clear interdependence between the choices for these features and the type of car. A single-choice model is not capable of capturing this complexity. APPROACHES FOR MULTI- CHOICE MODELING Most multi-choice situations can be modeled with resource-allocation tasks. These tasks are appropriate when for example the utility-value of the alternatives vary over situations, for instance. Take for example the choice for beer: the choice for a type or brand of beer may well depend on the context in which the beer is consumed. The choice may be different when alone at home compared to a social situation in a bar. It may vary when the weather is hot versus cold. Therefore, when asking for a choice, we’d better ask to indicate how often they would take the different types of beer. Another useful application of resource-allocation tasks is when the ‘law-of-diminishingreturns’ applies. This law is easy to understand in a shopping-context: the utility of the second pair of shoes one buys is much lower compared to the first pair of shoes chosen. This law also applies in the context of variety-seeking: when people visit a fair they rarely visit just one single attraction. Therefore, when we have a budget to spend, we most often allocate it to several choice-alternatives instead of just one. A resource allocation task is most appropriate in these situations. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 185 There are situations in which resource-allocation is not appropriate, however. The example of the choice of a car is not suited well for an allocation task (at least not an allocation task in the traditional sense, as the basic car and the add-ons are complementary products). A better design would be to include add-ons as additional options in choice-tasks and to treat the choice-problem as one in which the choice-alternative at the car-level and at the add-on level influence each other. Another good example is TV-viewing behavior. TV-viewing can be characterized by variety seeking as well: not many viewers want to watch the news all the time and strive towards a more varied portfolio of program genres that reflect their viewing preferences. Again, TV-viewing can not be dealt with appropriately by means of a resource-allocation task. The main complication is that the viewer is restricted in his allocation of viewing-time to the program-genres offered in a specific timeslot, due to the framing in time of the broadcasted programs. Therefore, the choices viewers make given the competitive program-grids as offered by a number of channels, can best be characterized as a number of inter-related, dependent single choices over time. MODELING TV-VIEWING BEHAVIOR In this article I want to clarify the choice-modeling issues that need to be dealt with when modeling such a complex choice-process as TV viewing-behavior. I will use the results of a study that I have undertaken to outline different ways of analyzing viewer behavior and preferences for channels and program-genres, I will also highlight potential pitfalls that may arise in the analysis. RESPONDENT CHOICE TASKS Choice-behavior with multiple dependencies can only be studied well by capturing the complexity of the choice-process in the respondent tasks. Essential is that the choice-tasks resemble well the actual choice-situations viewers face in reality. When studying TV-viewing behavior, this translates into offering respondents the programgrids of a number of channels for a defined time-interval, and asking the respondents which choices they would make.1 A number of simplifications are necessary in order to construct the choice-tasks. - First of all, we reduced the diversity of programs to a limited number of program-genres, since we wanted to represent the universe of program-genres, and not so much specific programs. In this study, 15 program genres were defined, to make up the attribute ‘program genres’ with 15 levels. - A number of channels needed to be selected. In this study 8 channels were selected based upon the dominance of the channels in the market. - In actual program grids, not all channels start and finish their programs at the same time. In order to keep the complexity to manageable proportions, timeslots with a length of 4 hours were defined with 4 programs lasting exactly one hour for each channel. This way, it was clear which the competing program-genres were in each hour of the timeslot. 1 This assumes that actual choices are primarily based upon ‘looking through the program-grids’ as they appear in newspapers and TV-guides, and not on ‘choose as you go’ or Zapping. However, this is probably the most feasible way of studying TVviewing behavior. 186 2001 Sawtooth Software Conference Proceedings: Sequim, WA. - In order to limit the information overload for the respondents, the number of channels in each choice-task is set to three. The channels are rotated in a random fashion from task to task. Next to the three channels there is the none-option, to give the respondents the opportunity to express that they don’t like any of the options. This exercise resulted in choice-tasks with the following ‘program-grid’-format: PREFERENCE CARD Time slot: Weekdays 1900-2300 Channel 1 Channel 2 Channel 3 1st program Football Magazine News Don’t watch 2nd program Game show Series Documentary Don’t watch 3r program News Pop music Foreign Film Don’t watch 4th program Local Cinema Variety show Series Don’t watch We asked respondents to indicate for each program-grid, which program/channel combinations they would choose given in the timeslot to which they were allocated. Basically, each choice-task requires four choices: one for each hour in the time-slot. Furthermore, the respondents could indicate that they didn’t want to watch any of the programs. In total, each respondent had to fill out 21 program-grids. In total, 1292 respondents were interviewed, resulting in a database with 108,528 choices available for analysis. The next table shows a hypothetical completed choice-task. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 187 PREFERENCE CARD Time slot: Weekdays 1900-2300 Channel 1 Channel 2 1st program Football Magazine 2nd program Game show 3r program News 4th program Channel 3 News Series Documentary Pop music Foreign film Variety show Local cinema Series Don’t watch Don’t watch Don’t watch Don’t watch ANALYSIS APPROACHES There are two basic approaches to analyze the interdependencies amongst the choices: 1. We consider the choices made at any one hour in a timeslot as a separate choice-task, and treat the availability of choice-alternatives in the other three hours in the timeslot as context-effects in the analysis. Table 2 illustrates this approach. PREFERENCE CARD Time slot: Weekdays 1900-2300 Channel 1 Channel 2 Channel 3 1st program Football Magazine 2nd program Game show Series Documentary Don’t watch 3r program News Pop music Foreign Film Don’t watch 4th program Local Cinema Variety show News Context-effects Don’t watch Context-effects 188 Series Don’t watch 2001 Sawtooth Software Conference Proceedings: Sequim, WA. The analysis of this specification is straightforward: in the coding of each choicealternative, these context-effects are being specified and constant for all choice-alternatives in the task. In the analysis the context-effects can be estimated as usual. The exact specification of the context-effects can take several formats. For example, one can specify how often the same program type is being offered in other hours in the timeslot, and study the effect of this ‘availability to see the program at a different hour’ on the propensity to choose this program. Or alternatively, one can study how the availability of a program-genre in one hour diminishes or increases the likelihood of choosing certain program-genres in the next hour. The clear disadvantage of this method is that it is very cumbersome (though not impossible) to explicitly model dependencies in choices, as this would make the coding of the context-effects dependent on choices made by the respondents in the other hours of the timeslot. 2. A more versatile, but also more complex method of analyzing the interdependencies is to consider all four choices made in the grid as one single choice. In other words: the ‘path’ the respondent chooses through the grid is being conceived as one choice-alternative out of all potential paths. Table 3 illustrates this approach: PREFERENCE CARD Time slot: Weekdays 1900-2300 Channel 1 Channel 2 1st program Football Magazine 2nd program Game show 3r program News 4th program Local cinema Series Pop music Variety show Channel 3 News Documentary Foreign film Series Don’t watch Don’t watch Don’t watch Don’t watch One can easily see that the number of choice-alternatives dramatically increases with the number of channels in the grid and/or hours in the timeslot. In the grids used in this study, the number of potential paths is 44 = 256 (the number of channels to the power of the number of hours in the timeslot). However, the flexibility in the analysis increases as well. We can now explicitly model dependent choices, as all choices are captured in the same single choicealternative (the ‘path’). Theoretically, all effects a choice in one hour can have on a choice in a different hour can be modeled. Given the large number of effects that one could look into, one would practically only include the estimation of those effects that can be argued to exist from a substantive point of view. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 189 CONJOINT DESIGN AND RESULTS In order to keep the flexibility in the type of effects to be estimated, the second approach was adopted in this study to analyze the choice-data. Guided by the effects our client was most interested in, the following effects were estimated: - Channel effects - Effects of program-genres for each hour in the timeslot - Selected interaction-effects between channel and program-genre (to answer the question what combinations of program-genre and channel do relatively well or do not so well) - ‘Horizontal’ cross-effects, i.e. the effect of specified program-genres on each other in the same hour (to study the effect that the audience is being taken away to another channel broadcasting the same or similar program-genres) - ‘Vertical’ cross-effects, i.e. the effect of specified program-genres on each other at different hours in the timeslot (to study the effect that the availability of specified program-genres at different times in the timeslot may have on the choice) The results for the effects of the program-genres, averaged out over the four hours in a timeslot, are displayed in figure 1. PROGRAM PREFERENCES Program utilities* Across all time slots 1.5 1 0.5 0 ine ts e ow ic sic az or sh us zin Mag mu sp a y m r r t g l la he rie na ma pu Ot Va re itio Po ltu ad u Tr C ll ba ot Fo s rie Se y w n film e film tar ho ate eti al am en ull eb Loc es ign g d b / m m re e u s y a t o g c i w l F G led Do Ne tua ow Ac Kn The utility values are derived from the Multinomial Logit model, and are therefore directly related to the likelihood of choosing one program-genre over another. E.g. if the utility value of one program type is 0.4 higher than the utility of another program-genre, the ‘better’ one is chosen e0.4 = 1.5 times more than the ‘worse’ one. If the difference would be 0.7, it would be chosen twice as often. 190 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Two different types of ‘horizontal’ cross-effects were estimated. First, the effect of the same program-genre at a different channel is -0.40, implying that people are 33% less likely to choose a program-genre, if a similar genre is being broadcasted at the same time on a different channel. The second type of horizontal cross-effect is for program-types that are alike but not exactly the same. Based upon factor-analysis, we defined that the following genres are alike: • actuality – news bulletin • football – other sports • domestic film – series The horizontal cross-effect (averaged out for all three combinations) is -0.28, meaning that people are 25% less likely to watch a program-genre, if a program genre that is alike but not similar is being broadcasted at a different channel during the same hour. More interestingly are the vertical cross-effects, as they can only be derived given the specific choice-design that allows the estimation of these effects. As we assumed that the information processing of the respondents would mainly occur in a sequential fashion, we estimated the vertical cross-effects for each hour separately. This hypothesis was confirmed, as the effects ranged from -0.01 in the first hour of the timeslot to -0.13 in the last hour of the slot, with an average of -0.07 over all hours in the timeslot (the effect of -0.13 means that people were 12% less likely to watch a program-genre if it had already been available in an earlier hour in the slot). The results indicate that the choice for a program genre in the first hour of the slot is hardly affected by the occurrence of this genre in later hours, but that in later hours one does take into account the availability of the genre in earlier hours. POTENTIAL PITFALLS In the analysis of dependent choice modeling, we explicitly conceive the ‘paths’ through the program-grids as the choice-alternatives. As long as we are interested in these paths, this works out fine. We can include ‘dependency-effects’ that explicitly model the interdependencies between program-genres over time.2 A dependency-effect could for example be used to study to what extent specific program-genres are being chosen in the same paths. However, we do need to realize that these effects essentially model the program-preferences of specific segments in the sample. For example, if football is being chosen in the first hour of the grid, one is also more likely to choose football later on. This is because the respondents in the sample who choose football in the first hours happen to like football more than other respondents, and are therefore more likely to also choose it in the second hour. This is all correct and the dependency effects are even necessary to build a model that correctly reflects people’s preferences for ‘paths’. However, sometimes we may not so much be interested in these paths, but more so in the question how often a specific program is chosen at a specified channel, regardless of the paths taken. Especially when working with simulations, in 2 A dependency-effect is different from a vertical cross-effect, as a dependency-effect models the effect of one choice on another choice, whereas a vertical cross-effect only models the effect of a (non-choice related) context on a choice. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 191 which grids are being specified and sensitivity-analysis is being performed, one may be tempted to aggregate all the paths that contain a specific cell in the grid (program-genre at one of the channels) to calculate the share of audience for that program. This is not without risks, however. I will try to clarify this point. Let’s assume we want to run a simulation based on the program-grid I showed earlier. As you can see, football appears only in the first hour (on channel A). Let’s assume we carry out a sensitivity-analysis in order to see what happens when channel 2 changes the series in the second hour to football, resulting in the following program-grid. PREFERENCE CARD Time slot: Weekdays 1900-2300 Channel 1 Channel 2 Channel 3 1st program Football Magazine News Don’t watch 2nd program Game show Football Documentary Don’t watch 3r program News Pop music Foreign Film Don’t watch 4th program Local Cinema Variety show Series Don’t watch What you would expect is that the audience-share for football in the first hour would decrease, as football-lovers do now have the option to watch their favorite sport later in the evening as well. However, what actually happens in the model when applying the aggregation procedure as described above, is that paths with football in both the first and second hour are boosted substantially, as the choice-model now has the opportunity to fit segment-specific preferences (the paths of football-fans). It should be clear that the aggregation-procedure results in erroneous output from the simulation model, and should therefore not be used. The underlying problem, causing this disparity between a specification of a model that in itself is correct, and results that are clearly counterintuitive, is the aggregate nature of the choicemodel (i.e. the estimated parameters are at the aggregate level, and not at the segment or individual level). If we would be able to estimate the dependency-effects at the individual level in a way that actual paths can be reproduced accurately by the model, this problem would be alleviated. However, in most situations the ratio between data-points available for each respondent and number of parameters to estimate for each respondent will make this problem unsolvable. 192 2001 Sawtooth Software Conference Proceedings: Sequim, WA. CONCLUSIONS In this study we tested a new methodology to study dependent choices in TV viewing behavior. The methodology introduces ‘dependency-effects’ to model the interrelatedness of choices people make. This methodology works well to the extent the main focus of the study is to understand the dependencies of the choices at the aggregate level. For TV-viewing behavior, dependency-effects reveal the extent to which specific sequences of choosing programs and channels from program-grids exist. These effects mainly reveal the heterogeneity of viewerpreferences in the market. The new methodology is not suited for studying the effects of program-changes in programgrids on audience-shares. Therefore, the effects that programs in the grid, whether broadcasted at the same time or a different time, can only be studied by means of cross-effects. Modeling dependent-choices is a neglected topic in choice-modeling. In this study a choice design was developed that allows dependent choice modeling. However, the results show that this approach has limitations that need to be resolved. One avenue for future research on choicedependency is to develop disaggregated choice-models that include these dependencies. Only then dependent choice modeling can be used to fully complement and enrich the ‘single choice’ methodology. Only then we will be able to model choice-problems that involve more than a single choice with one encompassing methodology. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 193 194 2001 Sawtooth Software Conference Proceedings: Sequim, WA. ALTERNATIVE SPECIFICATIONS TO ACCOUNT FOR THE “NOCHOICE” ALTERNATIVE IN CONJOINT CHOICE EXPERIMENTS1 Rinus Haaijer MuConsult2 Michel Wedel University of Groningen and Michigan Wagner Kamakura Duke University ABSTRACT In conjoint choice experiments a "no-choice" option is often added to the choice sets. When this no-choice alternative is not explicitly accounted for in the modeling phase, e.g. by adding a "no-choice constant" to the model, estimates of attributes may be biased, especially when linear attributes are present. Furthermore, we show that there are several methods, some equivalent, to account for the no-choice option. INTRODUCTION Choice experiments have become prevalent as a mode of data collection in conjoint analysis in applied research. The availability of new computer-assisted data collection methods, in particular CBC from Sawtooth Software, has also greatly contributed to its popularity. In conjoint choice experiments respondents choose one profile from each of several choice sets. In order to make the choice more realistic, in many conjoint experiments one of the alternatives in the choice sets is a “no-choice” or “none” option. This option can entail a real no-choice alternative (“None of the above”) or an “own-choice” alternative (“I keep my own product”). This base alternative, however, presents the problems of how to include it in the design of the choice experiment and in what way to accommodate it in the choice model. Regular choice alternatives are most often coded in the data matrix with effects-type or dummy coding. Since a no-choice alternative does not possess any of the attributes in the design, one may be tempted to code it simply as a series of zero’s. In this paper we investigate several specifications that can be used to accommodate the no-choice option. We show that when the no- 1 2 This paper is based on Haaijer, Kamakura, and Wedel (2001). M.E. Haaijer, Junior Projectleader MuConsult BV, PO Box 2054, 3800 CB Amersfoort, The Netherlands. Phone: +31-334655054, Fax: +31-33-4614021, Email: R.Haaijer@muconsult.nl. W.A. Kamakura, Professor of Marketing, The Fuqua School of Business, Duke University. P.O. Box 90120, Durham, NC 27708-0120, U.S.A. Phone: (919) 660-7855, Fax: (919) 681-6245, Email: Kamakura@mail.duke.edu. M. Wedel, Professor of Marketing Research, University of Groningen, and Visiting Professor of Marketing, Michigan University. P.O. Box 800, 9700 AV Groningen, The Netherlands. Phone: +31-50-3637065, Fax: +31-50-3637207, Email: M.Wedel@eco.rug.nl. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 195 choice alternative is not explicitly accounted for in the model phase, by adding an additional parameter to the model, estimates of the attribute dummies may be biased. THE BASE ALTERNATIVE In conjoint choice experiments a base alternative is included in the design of the experiment, among others, to scale the utilities between the various choice sets. A base alternative can be specified in several ways. First, it can be a regular profile that is held constant over all choice sets. Second, it can be specified as “your current brand” and third, as a “none”, “other” or “nochoice” alternative (e.g., Louviere and Woodworth 1983; Batsell and Louviere 1991; Carson et al. 1994). Additional advantages of including a “no-choice” or “own” base alternative that are mentioned in the literature are that it would make the choice decision more realistic and would lead to better predictions of market penetrations. A disadvantage of a no-choice alternative is that it may lead respondents to avoid difficult choices, which detracts from the validity of using the no-choice probability to estimate market shares. However, Johnson and Orme (1996) claim that this seems not to happen in conjoint choice experiments. In addition, the no-choice alternative gives limited information about preferences for attributes of the choice alternatives, which is the main reason for doing a conjoint choice experiment. In this paper we investigate the no-choice option from a modeling point of view3. We start by discussing a number of alternative model formulations. First, simply having a series of zeros describing the attribute values of the no-choice alternative seems a straightforward option, but this formulation may produce misleading results. When there are linear attributes present, the zero values of the no-choice alternative act as real levels of the linear attributes. When for instance price is a linear attribute in the design, the zero value for no-choice will correspond to a zero price. We hypothesize that this can lead to a biased estimate of the parameter of the linear attribute when the no-choice option is not accounted for in the model. Second, when all attributes are modeled with effects-type coding the bias discussed above does not arise, because all part-worths are now specified relative to the zero-utility of the nochoice alternative. However, even when all attributes are coded with effects-type dummies, adding such a constant for the no-choice option to the design matrix improves model fit. This can be explained because the no-choice option in fact adds one level to the attributes. Although this additional constant increases the number of parameters by one, it sets the utility level of the nochoice alternative. Finally, another way to model the presence of a no-choice option is by specifying a Nested Logit model. When two nests are specified, one containing the no-choice and the other the real product profiles, the no-choice alternative is no longer treated as just another alternative. The idea is that respondents first decide to choose or not and only when they decide to choose a real profile they select one of them, leading to a nested choice decision. This way of modeling the nochoice potentially also removes the effects of linear attributes because the zeros of the no-choice are no longer treated as real levels, because they are now captured in a different nest. 3 196 In the remainder of the paper we only mention the “no-choice” (or “none”), but the results also apply to the “own” alternative when nothing is known about its characteristics to the researcher. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. EQUIVALENT NO-CHOICE SPECIFICATIONS There are several equivalent methods to account for the no-choice option, in the sense that they all lead to the same overall (predictive) model fit, which is shown in the application section below. Of course, the estimates for some of the parameters differ across models depending on the specification used. The equivalent specifications that we consider are: 1. Include a “no-choice constant”, and model all attributes with effects-type and/or linear coding; 2. Include a “product category constant”, and model all attributes with effects-type and/or linear coding. In this situation the no-choice alternative is coded with only zeros; 3. Code one of the attributes with regular dummies (e.g. Brand-dummies), and all other attributes with effects-type and/or linear coding. In this situation the no-choice alternative is also coded with only zeros. In the application section these specifications will be estimated with the use of the Multinomial Logit model. It is well known that the MNL model may suffer from the IIAproperty. Several approaches have been developed that do not have this property. Haaijer et al. (1998) used a Multinomial Probit specification with dependencies between and within the choice sets of the respondents (see also Haaijer 1999, and Haaijer, Kamakura and Wedel 2000). Other studies used Mixed Logit specifications (e.g. Brownstone, Bunch and Train 2000) or Bayesian methods to account for IIA violations (e.g. McCulloch and Rossi 1994). However, in this paper we use the simple MNL model since it suffices to demonstrate our point. EMPIRICAL INVESTIGATIONS OF MODELING OPTIONS In this section we provide an application of a commercial conjoint choice data to illustrate the relative fits of the alternative models and coding of the attributes. Data Description4 The product we consider is a technological product with six attributes: Brand (6 levels), Speed (4 levels), Technology Type (6 levels), Digitizing Option (no and 2 yes-levels), Facsimile Capable (y/n), and Price (4 levels). The Price and Speed attributes are coded linear with {1, 2, 3, 4} for the four levels respectively (Speed ascending, Price descending) and the other attributes are coded using effects-type coding. We use 200 respondents whom each had to choose from 20 choice sets with four alternatives, where the last alternative is the “no-choice” option, which is defined as “none of the above alternatives”. We use the first 12 choice sets for estimation and the last 8 for prediction purposes. Each respondent had to choose from individualized choice sets. We compare the results of the models on the Log-Likelihood value, AIC (Akaike 1973) and BIC statistics (Schwarz 1978) and the Pseudo R2 value (e.g., McFadden 1976) relative to a null-model in which all probabilities in a choice set are equal to 1/M, with M the number of alternatives in each the choice set. The AIC criterion is defined as: AIC = - 2 ln L + 2n , where n is the total number of estimated parameters in the model and the BIC criterion is defined as: BIC = - 2 ln L + n ln (O) , where O is the number of observations in the conjoint choice 4 We thank Rich Johnson from Sawtooth Software for allowing us to analyze this data set. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 197 experiment. We test differences in the likelihood values for models that are nested with the likelihood ratio (LR) test. Estimating the Equivalent Specifications Table 1 gives the results of the estimation of the three equivalent specifications to account for the no-choice alternative in a choice experiment. All these models were estimated with the standard MNL model. As expected, all three specifications converge to exactly the same optimum and all give the same value of the likelihood for the predictions of the holdout choice sets. When the models have a parameter in common, the estimates are equal. The model with the no-choice constant and the model with the product-category constant differ only in the estimate for this constant, which are equal in value but opposite in sign. The model that contains the brand-dummies, instead of effect-type coding for the brand-attribute, only shows different estimates for these brand-dummies. Note that the utility for brand A in both other specifications is equal to minus the sum of the estimates for the brand B up to brand F parameters, which is the standard way to calculate the utility of the reference level of an attribute that is coded with effects-type coding. Note also that in Table 1 the estimate for the no-choice constant is relatively large and positive. This means that the no-choice has a high overall utility, which is also shown by the high number of times the no-choice alternative was actually chosen (in 43.1% of all choice sets). Similar, the large negative value for the product-category constant in the second model and the large negative values for all brand-dummies in the third model also show a low preference for the product-category. 198 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 1: Estimation results MNL specifications No Choice constant Estimate Parameters β00 Brand A β01 Brand B β02 Brand C β03 Brand D β04 Brand E β05 Brand F β06 Speed β07 Tech. Type A β08 Tech. Type B β09 Tech. Type C β10 Tech. Type D β11 Tech. Type E β12 Dig. Opt (n) β13 Dig. Opt (y1) β14 Facsimile β15 Price cnc No-Choice constant cpc Product category const. 0.472 -0.011 0.037 -0.187 0.020 0.129 -0.633 0.575 -0.368 0.628 -0.173 -0.732 0.172 -0.543 0.396 2.461 - Prod. Cat. constant s.e. Estimate * .066 .070 .071 * .075 .071 * .028 * .086 * .064 * .078 * .064 * .075 * .052 * .043 * .032 * .028 * .121 0.472 -0.011 0.037 -0.187 0.020 0.129 -0.633 0.575 -0.368 0.628 -0.173 -0.732 0.172 -0.543 0.396 -2.461 Brand dummies s.e. Estimate * .066 .070 .071 * .075 .071 * .028 * .086 * .064 * .078 * .064 * .075 * .052 * .043 * .032 * .028 * .121 s.e. -1.990 -2.473 -2.424 -2.648 -2.441 -2.793 0.129 -0.633 0.576 -0.368 0.628 -0.173 -0.732 0.172 -0.543 0.396 * .131 .140 * .142 * .144 * .141 * .144 * .028 * .086 * .064 * .078 * .064 * .075 * .052 * .043 * .032 * .028 * - Fit Statistics Ln-Likelihood AIC BIC Pseudo R2 Predict Statistics Ln-Likelihood AIC BIC Pseudo R2 -2663.017 5358.035 5450.566 0.200 -2663.017 5358.035 5450.566 0.200 -2663.017 5358.035 5450.566 0.200 -1706.087 3444.175 3530.219 0.231 -1706.087 3444.175 3530.219 0.231 -1706.087 3444.175 3530.219 0.231 *: p<0.05 Estimating Different Model Options In this section we compare the estimation results of the MNL model with the no-choice constant (we call this from here the No-Choice Logit model) with an MNL specification that does not contain this constant to show that not accounting for the no-choice option may give very misleading results. Furthermore, both are compared with the Nested MNL model that presents a different choice situation. The difference between the MNL model and the No-choice MNL model is the extra constant (cnc) added for the no-choice option in the design, but both models fall within the standard Multinomial Logit context for conjoint experiments. In the Nested Logit model there is one extra parameter (λ) called the dissimilarity coefficient (Börsch-Supan 1990). When its value is equal to 1, the Logit and Nested Logit model are equal. For all models we use two versions, in the first situation the linear levels are coded as: {1, 2, 3, 4} respectively, and in the second situation as {-3, -1, 1, 3}, to investigate whether mean 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 199 centering the linear levels solves (part of) the problem in any of the three models considered. Table 2 lists the estimation result for all models. Note that the Nested Logit and the No-choice Logit models are not nested, but both are nested within the Logit model without the constant. In the Nested Logit model we do not estimate λ itself but estimate (1-λ) instead, to have a direct test on λ=1. The first conclusion that can be drawn from Table 2 is that the No-choice Logit model gives the best overall fit, in both situations, and converged to the same point. The Likelihood is significantly better than the standard MNL model (LR(1 df) tests, p<0.01). The No-choice Logit model and the Nested Logit model are not nested, so these models cannot be compared with an LR test. The AIC and BIC values show, however, that the No-choice Logit model fits better than the Nested Logit model, which itself is significantly better than the Logit model (LR(1) tests, p<0.01) again in both situations. Table 2 also shows that the estimates for the dissimilarity coefficients (λ) are significantly different from 1 for the Nested Logit model, hence the Nested Logit differs significantly from the MNL model. Table 2: Model: Levels linear attributes: {1, 2, 3, 4} No-Choice MNL Nested MNL MNL Est. s.e. Est. s.e. Est. s.e. Parameters β01 Brand B β02 Brand C β03 Brand D β04 Brand E β05 Brand F β06 Speed β07 Tech. Type A β08 Tech. Type B β09 Tech. Type C β10 Tech. Type D β11 Tech. Type E β12 Dig. Opt (n) β13 Dig. Opt (y1) β14 Facsimile β15 Price 0.386 -0.013 0.052 -0.150 0.011 -0.237 -0.531 0.505 -0.321 0.514 -0.132 -0.586 0.128 -0.445 -0.013 1-λ Nested Logit cnc No-Choice - Fit Statistics Ln-Likelihood AIC BIC Pseudo R2 Predict Statistics Ln-Likelihood AIC BIC Pseudo R2 200 Estimation and prediction results * .061 .067 .067 * .071 .067 * .021 * .081 * .060 * .075 * .060 .071 * .049 * .041 * 0.30 .020 0.522 -0.009 -0.006 -0.166 -0.018 0.118 -0.678 0.613 -0.366 0.635 -0.149 -0.714 0.180 -0.528 0.385 * 0.472 -0.011 0.037 -0.187 0.020 0.129 -0.633 0.575 -0.368 0.628 -0.173 -0.732 0.172 -0.543 0.396 0.924 - * 2.461 .077 .078 .080 * .087 .081 * .031 * .096 * .074 * .086 * .074 .084 * .055 * .046 * .035 * .031 .009 * .066 .070 .071 * .075 .071 * .028 * .086 * .064 * .078 * .064 * .075 * .052 * .043 * .032 * .028 * .121 Levels linear attributes: {-3, -1, 1, 3} No-Choice MNL Nested MNL MNL Est. s.e. Est. s.e. Est. s.e. 0.350 -0.012 0.024 -0.142 0.013 0.046 -0.451 0.432 -0.281 0.471 -0.132 -0.488 0.099 -0.386 0.144 * .064 .069 .071 * .074 * .070 * .014 * .084 * .062 * .077 * .062 .073 * .050 * .043 * .031 * .014 - .0524 -0.010 0.001 -0.176 -0.011 0.067 -0.682 0.621 -0.377 0.651 -0.159 -0.721 0.178 -0.539 0.203 * 0.472 -0.011 0.037 -0.187 0.020 0.064 -0.634 0.575 -0.368 0.628 -0.173 -0.732 0.172 -0.543 0.198 0.840 - * 1.150 .076 .078 .080 * .086 .080 * .016 * .095 * .073 * .086 * .073 .083 * .055 * .046 * .035 * .015 .017 * .066 .070 .071 * .075 .071 * .014 * .086 * .064 * .078 * .064 * .075 * .052 * .043 * .032 * .014 * .048 -2906.738 5843.476 5930.224 0.126 -2715.413 5462.826 5555.358 0.184 -2663.017 5358.035 5450.566 0.200 -2947.716 5925.431 6012.180 0.114 -2701.927 5435.854 5528.386 0.184 -2663.017 5358.035 5450.567 0.200 -1883.069 3796.139 3876.805 0.151 -1741.723 3515.448 3601.492 0.215 -1706.087 3444.175 3530.219 0.231 *: p<0.05 -1960.978 3951.956 4032.622 0.116 -1735.014 3502.028 3588.072 0.218 -1706.087 3444.175 3530.219 0.231 2001 Sawtooth Software Conference Proceedings: Sequim, WA. When the β-estimates are compared, Table 2 shows that the parameter estimates of the attributes with a dummy-coding (β01, …, β05, β07, …, β14) are somewhat different, although not dramatically so. However, in the left-hand side of Table 2, the coefficients of the linear attributes (β06, β15) in the standard MNL model differ strongly from the other two models. Whereas the estimate for speed is negative (a high level is unattractive) and significant for the MNL model, it is positive (a high level is attractive) and significant for the other models. The price parameter shows a similar effect; it is negative but not significant in one situation and positive in the other for the MNL model and positive (lower price is more attractive) and significant in the other two models. Clearly, both estimates for the linear attributes show a strong negative bias. Note, however, that there are also differences in the other part-worth estimates across the models. The right-hand side of Table 2 shows that when the Speed and Price variables are coded with values such that the mean of the levels is zero, the estimates for Speed and Price do not longer show the wrong sign, but are still biased downwards compared to the other models. In the Nochoice MNL model the estimates for all parameters are equal in both situations, except for the linear parameters which have in the right-hand side of Table 2 exactly half the value of those in the left-hand side of Table 2, which is the result of the doubled step-length of the linear levels. When the attributes Price and Speed are also coded with effects-type coding (not shown), the β-estimates are more similar across the three models, all having the same signs, however the MNL model that explicitly accounts for the no-choice option still outperforms the MNL model without the constant and the NMNL model. The conclusion that can be drawn from the above analysis is that the presence of a no-choice alternative and linearly coded attributes can give very misleading results, in particular for the parameters of those linear attributes when the conjoint choice data is estimated with a standard Logit model, without accounting for the no-choice option. However, the parameters of attributes coded with effects-type dummies are also affected, be it less severely. When all attributes are coded with effects-type coding the bias seems less strong, but still coefficients estimates are highly attenuated. Overall fit can be improved substantially by specifying a Nested Logit or by adding a No-choice constant to the design. When we compare the Nested Logit and the No-choice Logit results we see that both compensate for the no-choice zero level for the linear attributes, but there are some differences in the magnitudes of the estimated coefficients, some in the range of 5-10%, which may be important in substantive interpretation. However, the fit of the No-choice Logit model is much better than the Nested Logit model in our application. The estimates in Table 2 were used to predict the 8 holdout choice sets. Table 2 gives the values of the statistics for the predictive fit of the three models for the three different designs options considered. The No-choice Logit model gives the best predictions which are significantly better than the Logit model (LR(1) tests, p<0.01) and which are also better than the Nested Logit model in all situations. The Nested Logit model also predicts significantly better than the standard Logit model (LR(1) tests, p<0.01). Thus, the predictive validity results confirm the results on model fit. Note that although the MNL model with linear levels {1, 2, 3, 4} is clearly misspecified (as could be seen from the Speed and Price estimates), the likelihood, both in 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 201 estimation and prediction, is better than those of the MNL models with both other ways of coding. However, in all situations the Nested Logit and No-choice Logit models show superior fit. CONCLUSIONS AND DISCUSSION Respondents may choose the no-choice alternative for two reasons. First, they may not be interested at all in the product category under research and for this reason choose the no-choice. In such a situation they would first decide whether to choose for the offered product profiles or not and the Nested MNL model may be the most appropriate specification to describe this behavior, since it puts the no-choice alternative in a different nest than these product profiles. The probability of the no-choice alternative may be an indication for the overall preference of the product in this case, and the model may be used to obtain an estimate of the overall attractiveness of the product category. Thus, in this case the no-choice alternative would capture “real” behavior of consumers in the market place. Second, respondents may choose the no-choice because no real alternative in the choice set is attractive enough or because all alternatives are roughly equally attractive and they do not want to spend more time on making the difficult choice. In this case the respondent treats the no-choice as “just another” alternative, and the nochoice captures an effect specific to the task. If this is the case, the MNL model, with a no-choice constant, is the appropriate model to use, since it treats all alternatives equal. Now the utility of the no-choice option does not have a substantive meaning but serves as an indicator of respondents’ involvement with the task. In our application we saw that the No-choice MNL model produced better results compared to the Nested MNL model. This may be an indication that the second explanation for choosing the no-choice may have been appropriate. In other words, when a conjoint choice experiment with a no-choice alternative is estimated with both the No-choice MNL model and the Nested MNL model, the (predictive) fit of the models may give an indication of the substantive reasons respondents have to choose the no-choice option. In particular, when the No-choice MNL model provides the best fit it may be inappropriate to interpret the estimates as reflecting the overall attractiveness of the category. We would like to note that in our application the effect of the nochoice option itself was significant. If that is not the case, the model converges to the standard MNL and may neither fit nor predict better. However, the inclusion of the no-choice constant in the model allows for this test, and at the same time gives an indication of whether the standard MNL would be more appropriate. We also showed that there are at least three equivalent ways to account for the no-choice option in the MNL model. One could add a no-choice constant or a product category constant, or one could code one of the attributes (e.g. Brand) with regular dummies and code the (remaining) attributes in the design with effects-type and/or linear coding. Not accounting for the no-choice option may lead to biased estimates of the attribute-levels, especially the estimates of the linear attributes may be highly affected. 202 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle”, In: B.N. Petrov and F. Csáki (eds.) “2nd International Symposium on Information Theory”, Akadémiai Kiadó, Budapest, 267-281. Batsell, R.R. and J.J. Louviere (1991), “Experimental Analysis of Choice”, Marketing Letters, 2(3), 199-214. Börsch-Supan, A. (1990), “On the Compatibility of Nested Logit Models with Utility Maximization”, Journal of Econometrics, 46, 373-388. Brownstone, D., D.S. Bunch, and K. Train (2000), “Joint Mixed Logit Models of Stated and Revealed Preferences for Alternative-fuel Vehicles,” Transportation Research B, 34(5), 315338. Carson, R.T., J.J. Louviere, D.A. Anderson, P. Arabie, D.S. Bunch, D.A. Hensher, R.M. Johnson, W.F. Kuhfeld, D. Steinberg, J. Swait, H. Timmermans, and J.B. Wiley (1994), “Experimental Analysis of Choice”, Marketing Letters, 5(4), 351-368. Haaijer, M.E. (1999), “Modeling Conjoint Choice Experiments with the Probit Model”, Thesis, University of Groningen. Haaijer, M.E., W.A. Kamakura, and M. Wedel (2000), “The Information Content of Response Latencies in Conjoint Choice Experiments”, Journal of Marketing Research, 37(3), 376-382. Haaijer, M.E., W.A. Kamakura, and M. Wedel (2001), “The ‘No-choice’ Alternative in Conjoint Choice Experiments”, International Journal of Market Research, 43(1), 93-106. Haaijer, M.E., M. Wedel, M. Vriens, and T.J. Wansbeek (1998), “Utility Covariances and Context Effects in Conjoint MNP Models”, Marketing Science, 17(3), 236-252. Louviere, J.J. (1988), “Conjoint Analysis Modelling of Stated Preferences. A Review of Theory, Methods, Recent Developments and External Validity”, Journal of Transport Economics and Policy, January, 93-119. Louviere, J.J. and G. Woodworth (1983), “Design and Analysis of Simulated Consumer Choice or Allocation Experiments: An Approach Based on Aggregate Data”, Journal of Marketing Research, 20(4), 350-367. McCulloch, R.E. and P.E. Rossi (1994), “An Exact Likelihood Analysis of the Multinomial Probit Model”, Journal of Econometrics, 64, 207-240. McFadden, D. (1976), “Quantal Choice Analysis: A Survey”, Annals of Economic and Social Measurement, 5(4), 363-390. Schwarz, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461-464. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 203 204 2001 Sawtooth Software Conference Proceedings: Sequim, WA. HISTORY OF ACA Richard M. Johnson Sawtooth Software, Inc. INTRODUCTION Although ACA makes use of ideas that originated much earlier, the direct thread of its history began in 1969. Like much work of the development in marketing research, it began in response to a client problem that couldn’t be handled with current methodology. THE PROBLEM In the late ‘60s I was employed by Market Facts, Inc., and the client was in a durable goods business. In his company it was standard practice that whenever a new or modified product was seriously contemplated, a concept test had to be done. The client was responsible for carrying out concept tests, and he answered to a product manager who commissioned those tests. Our client’s experience was like this: The product manager would come to him and say: “We’re going to put two handles on it, it’s going to produce 20 units per minute, it will weigh 30 pounds, and be green.” Our client would arrange to do a test of that concept, and a few weeks later come back with the results. But before he could report them, the product manager would say: “Sorry we didn’t have time to tell you about this, but instead of two handles it’s going to have one and instead of 20 units per minute it will produce 22. Can you test that one in the next three weeks?” And so on. Our client found that there was never time to do the required concept tests fast enough to affect the product design cycle. So he came to us with what he considered to be an urgent problem – the need to find a way to test all future product modifications at once. He wanted to able to tell the product manager, “Oh, you say it’s going to have one handle, with 22 units per minute, weigh 30 pounds and be green? Well, the answer to that is 17 share points. Any other questions?” Of course, today this is instantly recognizable as a conjoint analysis problem. But Green and Rao had not yet published their historic 1971 article, “Conjoint Measurement for Quantifying Judgmental Data” in JMR. Also, the actual problem was more difficult than indicated by the anecdote above, since the client actually had 28 product features rather than just four, with some having as many as 5 possible realizations. Tradeoff Matrices It seemed that one answer might lie in thinking about a product as being a collection of separate attributes, each with a specified level. This presented two immediate problems: a new method of questioning was needed to elicit information about values of attribute levels, and a new estimation procedure was needed for converting that information into “utilities.” 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 205 Our solution came to be known as “Tradeoff Analysis.” Although I wasn’t yet aware of Luce and Tukey’s work on Conjoint Measurement, that’s what Tradeoff Analysis was. To collect data, we presented respondents with a number of empty tables, each crossing the levels of two attributes, and asked respondents to rank the cells in each table in terms of their preference. We realized that not every pair of attributes could be compared, because that might lead to an enormous number of matrices to be ranked. After much consideration, we decided to pair each attribute with three others, which resulted in 42 matrices for the first study. One has to experience filling out a 5x5 tradeoff matrix before he can really understand what the respondent goes through. If the respondent must fill out 42 of them, one can only hope he remains at least partially conscious through the task. To estimate what we now call part-worths, we came up with a non-metric regression procedure which found a set of values for each respondent which, when used to border the rows and columns of his matrices, produced element-wise sums with rank orders similar to respondents’ answers. Although we learned a lot about how to improve our technique for future applications, this first study, conducted in 1970, was a success. The client was enthusiastic about his improved ability to respond to his product manager’s requests. The client company commissioned many additional tradeoff studies, and similar approaches were used in hundreds of other projects during the next several years. In those early days there was less communication between practitioners and academics than we enjoy today. My early work at Market Facts was done almost in a vacuum, without the knowledge that a larger stream of similar development was taking place simultaneously among Paul Green and his colleagues. ACA benefited greatly from interactions with Paul in later years, and as time passed it became clear that Tradeoff Analysis was just a different variety of Conjoint Analysis. As such, it made all of the assumptions common to Conjoint Analysis, plus one more big one. Assumptions and Difficulties Like other conjoint methods, we assumed that the utility of a product was the sum of values attaching to its separate attribute levels. However, Tradeoff Analysis, like all more recent “partial profile” methods, further assumed that respondents’ values for attribute levels did not depend on which other attributes were present in a concept description. In other words, Tradeoff Analysis required a strong “all else equal” assumption regarding the attributes omitted from each matrix. This made Tradeoff Analysis uniquely vulnerable to distortion if attributes were not considered to be independent by respondents. Suppose two attributes are different in the mind of the researcher, but similar in the mind of the client, such as, say, Durability and Reliability. When trading off Durability with price, the respondent may fear he is giving up Reliability when considering a lower level of Durability. This kind of double-counting can lead to distorted measures of attribute importance. 206 2001 Sawtooth Software Conference Proceedings: Sequim, WA. As another example, price is often regarded as an indicator of quality. As long as partial profile concept presentations include both price and quality, one would not expect to see reversals in which higher prices are preferred to lower ones. However, if concept presentations include only price but not quality, price may be mistaken as an indicator of quality and respondents may act as though they prefer higher prices. Similar problems still characterize all partial profile methods today, and it remains critically important when using partial profile methods to remind respondents that the concepts compared are to be considered identical on all omitted attributes. A second problem unique to Tradeoff Analysis was the difficulty respondents had in carrying out the ranking task. Though simple to describe, actual execution of the ranking task was beyond the capability of many respondents. We observed that many respondents simplified the task by what we called “patterned responses,” which consisted of ranking the rows within the columns, or the columns within the rows, thus avoiding the more subtle within-attribute tradeoffs we were seeking. This difficulty appeared to be so severe that it motivated the next step in the evolution which resulted in ACA. Computer-Assisted Interviewing Researchers who began their careers in the ‘70s or later will never be able to appreciate the dramatic improvement of computer technology that occurred during the ‘50s and ‘60s. In the late ‘50s “high speed” computers were available, but only at high cost and in a limited way. While at Procter and Gamble in the early ‘60s I considered myself lucky to have access to a computer at all, but I would get one or at best two chances in a 24 hour period to submit a programming project for debugging. A single keypunch error would often render an attempt useless. It’s amazing that we were able to get any work done at all under those conditions. However, in the ‘70s time sharing became common, providing an enormous improvement in access to computers. In marketing research, we depend heavily on data from survey respondents. When something is wrong in a set of results, it can often be traced to a problem at the “moment of impact,” when the respondent provided the data. Originally having been trained as a psychologist, I was interested in the dynamics of what happens in interviews. When time sharing and CRT terminals first became available, I became excited about the possibility of using them to enhance the quality of market research interviews. I still remember an experience at Market Facts when I arranged a meeting of the company’s management to demonstrate the radical idea of computer-assisted interviewing. I had borrowed the most cutting-edge CRT terminal of the time, which consisted of a tiny 3-inch screen in an enormous cabinet. I had shrouded the CRT with a cloth so I could introduce the idea of computer-assisted interviewing without distraction. The meeting went well until the unveiling, when, with a flourish, I removed the cloth to reveal the CRT. When they saw the tiny screen in the enormous cabinet, everyone in the room began to laugh. And they continued laughing until I ended the meeting. Fortunately, CRT terminals also improved rapidly, and it wasn’t long before computer-assisted interviewing became entirely feasible. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 207 Pairwise Tradeoff Analysis Ranking cells in a matrix can be difficult for respondents, but answering simple pairwise tradeoff questions is much easier. For example, we could ask whether a respondent would prefer a $1,000 laptop weighing 7 pounds or a $2,000 laptop weighing 3 pounds. Consider two attributes like Price and Weight, each with three levels. In a 3x3 tradeoff matrix there are 9 possible combinations of levels, or cells. We could conceivably ask as many as 36 different pairwise preference questions about those 9 cells, taken two at a time. However, if we can assume we know the order of preference for levels within each attribute, as we probably can for price and weight, we can avoid asking many of those questions. Suppose we arrange the levels of each attribute in decreasing order of attractiveness, so that cells above and to the left should be preferred to those below or to the right. Then we can avoid questions comparing two classes of cells. First, we can avoid questions comparing any two cells that are similar on one attribute, such as comparisons of cells in the same row or in the same column. This avoids 18 possible questions. Of the possible questions that remain, we can avoid those that compare any cell with another that is dominated on both attributes, such as below it and to its right. That eliminates another 9 questions, leaving a total of only 9 for which we cannot assume the answer. Among those, if we are lucky (and if respondents answer without error) we may have to ask only two questions to infer the answers of the remaining seven. For example, in the 3x3 matrix with lettered cells, a d g b e h c f i if we were to learn that c is preferred to d and f is preferred to g, then we could infer that rows dominate columns in importance, and we could infer the rank order of all 9 cells. Likewise, learning that g is preferred to b and h is preferred to c would permit inference that column differences are all more important than any row differences, and we could also infer the entire rank order. By the mid ‘70s computer technology had advanced sufficiently that it became feasible to do computer-assisted Tradeoff Analysis using pairwise questioning. A large project was undertaken for a U.S. military service branch to study various recruiting incentives. The respondents were young males who had graduated from high school but not college. A large number of possible incentives were to be studied, and we were concerned that the required number of tradeoff matrices would strain the capabilities of our respondents. My associate at Market Facts, Frank Goode, studied strategies for asking pairwise questions that would be maximally informative, and wrote a question-selecting program that could be used to administer a pairwise tradeoff interview. We purchased what was then described as a “minicomputer,” which meant that it filled only a small room rather than a large one. 208 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Respondents sat at CRT terminals at interviewing sites around the U.S., connected to a central computer by phone lines. Each respondent was first asked for within-attribute preferences, permitting all attributes subsequently to be regarded as ordered, and then he was asked a series of intelligently chosen pairwise tradeoff questions. We found that questioning format to be dramatically easier for respondents than filling out tradeoff matrices. The data turned out to be of high quality and the study was judged a complete success. That study marked the beginning of the end for the tradeoff matrix. Microcomputer Interviewing In the late ‘70s Curt Jones and I founded the John Morton Company, a partnership with the goal of applying emerging analytic techniques in a strategic marketing consulting practice. We were still utterly dependent on the quality of the data provided by respondent interviews, and that led to many problems; but I remained convinced that computer-assisted interviewing held at least part of the answer. By that time the first microcomputers were becoming available, and it seemed that computerassisted interviewing might finally become cost-effective. We purchased an Apple II and I began trying to produce software for a practical and effective computer-assisted tradeoff interview. Use of microcomputers meant not having to be connected by phone lines and not having to wait for one’s turn in time sharing, and also provided powerful computational resources. My initial approach differed from the previous one in several ways: First, it made more sense to choose questions that would reduce uncertainty in the partworths being estimated, rather than choosing questions to predict how respondents might fill out tradeoff matrices. This was a truly liberating realization, which greatly simplified the whole approach. Second, it made sense to update the estimates of part-worths after each answer. Each update took a second or two, but respondents appeared to appreciate the way the computer homed in on their values. One respondent memorably likened the interview to a chess game where he made a move, the computer made a move, etc. Third, a “front-end” section was added to the interview, during which respondents chose subsets of attributes that were most salient to them personally, as well as indicating the relative importance of each attribute. The questioning sequence borrowed some ideas from “Simalto,” a technique developed by John Greene at Rank Xerox. We used this information to reduce the number of attribute levels to be taken into the paired-comparison section of the interview, as well as to generate an initial set of self-explicated part-worths which could be used to start the pairedcomparison section of the interview. Finally, those paired-comparison questions were asked using a graded scale, from “strongly prefer left” to “strongly prefer right.” Initially we had used only binary answers, but found additional information could be captured by the intensity scale. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 209 Small computers were still rare, so the experience of being interviewed had considerable entertainment value. We found that an effective way to sell research projects was to preprogram a conjoint interview for a prospective client’s product category and take an Apple with us on the sales call. Once a marketing executive had taken the interview and had seen his own part-worths as revealed by the computer, he often couldn’t wait to use the same technology in a project. We purchased several dozen Apple computers, and began a fascinating adventure of using them all over the world, in many languages and in product categories of almost every description. Those early Apples were much less reliable than current-day computers. I could talk for hours about difficulties we encountered, but the Apples worked well enough to provide a substantial advance in the quality of data we collected. ACA In 1982 I retired as a marketing research practitioner, moved to Sun Valley, Idaho, and soon started Sawtooth Software, Inc. I had been fascinated by the application of small computers in the collection and analysis of marketing research data, and was now able to concentrate on that activity. IBM had introduced their first PC in the early ‘80s, and it seemed clear that the “IBMcompatible” standard would become dominant, so we moved from the Apple II platform to the IBM DOS operating system. With that move we achieved 80 characters per line rather than 40, color rather than monochrome, and a large improvement in hardware reliability. ACA was one of Sawtooth Software’s first products. The first version of ACA offered comparatively few options. Our main thought in designing it was to maximize the likelihood of useful results, which meant minimizing the number of ways users could go wrong. I think we were generally successful in that. ACA had the benefit of being developed over a period of several years, during which its predecessors were refined in dozens of actual commercial projects. Although there were some “ad hoc” aspects of the software, I think it is fair to say that “it worked.” During the last 20 years I’ve had many helpful interactions with Paul Green and his colleagues. One of the most useful was a JMR article by Green, Kreiger, and Agarwall with suggestions about how to combine data from the self-explicated and paired comparison sections. Those suggestions led to a major revision of the product which provided additional user options. ACA has also benefited from helpful contributions of other friendly academics, especially Greg Allenby and Peter Lenk. The ACA/HB module uses Bayesian methods to produce estimates of individual part-worths that are considerably better than the usual estimates provided by ACA. In particular, HB provides a superior way to integrate information from the two parts of the interview. That consists of doing standard Bayesian regression where the paired comparison answers are the only data, and where the self-explicated data are used only as constraints. 210 2001 Sawtooth Software Conference Proceedings: Sequim, WA. I believe ACA users who are content with the ordinary utilities provided by ACA are too satisfied with their results. The results from using the HB module are enough better than those of standard ACA that I think the HB module should almost always be used. I have been involved in one way or another with ACA for more than 30 years. During that time it has evolved from an interesting innovation to a popular tool used world-wide, and has been accepted by many organizations as a “gold standard.” As I enter retirement, others are carrying on the tradition, and I believe you will see continuing developments to ACA that will further improve its usefulness. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 211 212 2001 Sawtooth Software Conference Proceedings: Sequim, WA. A HISTORY OF CHOICE-BASED CONJOINT Joel Huber Duke University INTRODUCTION I will present today a brief and highly selective history of choice-based conjoint. In the shadow of largely negative critical but positive popular response to Jurassic Park III, I propose that it is useful to think about ACA as the most efficient predator, the T-Rex of the rating-based conjoint systems. Just as ACA dominated most of the world of conjoint, I predict that choice based conjoint may evolve to eventually dominate over ratings based systems. The purpose of this talk it to indicate how choice experiments, originally seeming so like tiny mammals scurrying under the heels of mighty dinosaurs, are likely to dominate the marketing research landscape. WHY CHOICES? The first question to ask is why should we base preference measurement on choices instead of ratings, even ACA’s extremely clever battery of questions? There are three primary reasons: 1. Choice reflects what people do in the marketplace. In contrast to ratings, which people rarely do unless asked by market researchers, people make choices every day. These choices make the difference between success and failure for a product or a company. Choices can be designed to replicate choices in the marketplace, but more important to assess what people would choose if options were available. 2. Managers can immediately use the implications of a choice model. Choice models can be fed into a simulator to estimate the impact of a change in price on the expected share of an item. With choices it is not necessary to make the assertion that people’s ratings will match their choices, we only need to assert that their stated choices will match actual choices. Matching choices may require a leap, but the leap is much more justifiable and less risky than the hurdle between ratings and choices. 3. People are willing to make choices about almost anything. It is surprising how people are willing to make choices, but less willing to offer general judgments supporting those choices. For example, suppose you were to ask a person how much more they would pay per year to buy electricity from a company that won an award from Friends of the Earth, over one that costs you $100 but is on the worst polluter list. Most respondents would consider that a hard question and might be reluctant to answer. However, they have little problem with a choice between an award winning utility that cost $125 per month versus one on the worst polluter list that costs $100. Just as people are facile at making choices given partial data in the market, so they have little problem responding to hypothetical choices in a survey. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 213 In short, there are good reasons for basing business decisions of the responses of people to hypothetical choices. The reason they were not used initially is that choices were so very hard to model. The story of the evolution of the technology that could tame choices to be able to appropriately model market behavior encompasses the rest of this talk. WHITHER CHOICES? Where did choice experiments come from? Choice experiments in their simplest form are as old as commerce. All they require is the monitoring of sales as a function of different prices or features. The problem for choices is just the same as the one that motivated the development of ACA. How can we set up choice experiments to make predictions not just for one shift in an offering, but for a whole range of possible changes? Choices harbor an obvious disadvantage compared with ratings when constructing a general utility function—choices contain much less information. Contrast the information in a choice among four items with a rating on each. The choice only tells which is best, whereas the ratings tell that plus information on how much better each is compared with the others. This power of ratings is illustrated by the fact that a group of 16 ratings constructed with an efficient factorial design offers substantial information on the utility of five attributes each with four levels. By contrast, one choice out of 16 indicates which is best and little more. Not only are choices relatively uninformative, modeling them can be problematic. Choice probabilities are constrained to be greater than zero and less than one, making linear models inappropriate and misleading. Maximum likelihood methods can provide logically consistent models, but, until recently have not been able to deal with the critical issue of variability in individual tastes. I will first review the evolution of maximum likelihood models and then move to the far more perplexing issue of heterogeneity. ADAPTING OLS TO CHOICE Historically, early choice modelers attempted to solve the estimation problem by building on the model on which most of us were weaned, ordinary least squares. The strategy was to treat choice probabilities as if they were continuous and defined over the real line. This “linear probability model” suffered both estimation and conceptual flaws. In terms of estimation, it flagrantly defied the constant heteroscedasticity assumption of OLS, but worse permitted probability estimates that were less than zero or greater than one. Conceptually, its linearity posed a second, and for marketers a more vexing problem—it assumed a link between market actions and share estimates that almost everyone knew was wrong. Regression assumes that the relationship between share estimates and market efforts is linear, whereas in fact it needs to follow an s-shaped or sigmoid distribution. To understand why a sigmoid shape is important, consider the following thought experiment. Suppose three children’s drinks reside in different categories, one drink with 5%, the second 50%, and the third with 95% share of its market. Which drink is most likely to gain from the addition of a feature, such as a price cut or a tie into Jurassic Park? The answer is generally the one with the moderate share; the brand with 5% share has not yet developed a critical mass, and 214 2001 Sawtooth Software Conference Proceedings: Sequim, WA. the one with 95% share has no place to grow. The linear model assumes that the share boost from the feature is the same (say 5%) regardless of its original share, something few of us would expect. The solution was to transform the choice probabilities so that they would follow an s-shaped curve shown in Figure 1. Historically, various functional forms were used, but most focus was placed on the logistic transformation and the cumulative distribution for the normal curve. The normal ogive makes the most theoretical sense, since under the central limit theorem it approximates randomness arising from the aggregation of independent events. However, since the logit is easier to estimate and is indistinguishable from the normal distribution except for its slightly heavier tails, logit came into common use. The logistic transformation for the binary choice takes a very simple form. It says the probability of choosing an item is transformed into utility by Ux = log(px/(1-px)). (1) Solving for px provides the familiar expression for probability, px = 1/(1+ eUx ), (2) which when graphed looks just like Figure 1. Figure 1 Typical Sigmoid Curve Choice Probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 Marketing effort 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 8 215 More significantly, if we take the first derivative of probability with respect to Ux, we get dpx/dUx = px(1-px), (3) which when graphed, looks like Figure 2, having a maximum at px = .5. Thus the marginal impact of any action affecting utility is maximized at moderate probabilities and minimized close to zero or one, just as one would expect. Figure 2 Marginal value of incremental effort 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Original probability of choice 1 The logistic transformation solved a conceptual problem by replacing the input choice probabilities with their logits, following Equation (1). However, this clever solution raised what proved to be an intractable implementation problem—logits are not defined for probabilities of zero or one. They approach negative infinity for probabilities approaching zero and positive infinity for those approaching one. This became a problem for the managerial case where probabilities are often zero or one. An obvious solution was to substitute a probability close to zero or one when they occurred, based on the premise that it shouldn’t matter if as long as it is sufficiently close. Unfortunately, it importantly matters how close one is. Contrast substituting .9 as opposed to .99999 for 1.0. In the latter case ln(p/(1-p) becomes a relatively large number, imputing a large effect on the solution. Thus the analyst is put in an untenable position of having to make an apparently arbitrary adjustment that strongly changes the results. 216 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Then, about 25 years ago, Daniel McFadden published his seminal paper on conditional logit models of choice (McFadden 1976). While much of what he wrote was derived from other scholars, he was the first to put it all together in one extraordinary paper. His development of the conditional logit model contained three important components: 1. The choice probabilities follow from a random utility framework. Random utility assumes that the utility of each alternative has a defined distribution across the population. The probability of an item being chosen equals the probability that its momentary utility is the best in the set. If these random utilities are distributed as multivariate normal, then Probit results. Conditional logit assumes that they are distributed as independent variables with Weibull distributions. 2. The conditional logit choice model is estimable using maximum likelihood. Unlike least squares, MLE has no difficulty with choice probabilities of zero or one. One simply estimates the likelihood of such events given the parameters of the model. The article also offered efficient ways to search for the solution by providing the derivatives of the closed form of the likelihood function with respect to the parameters. 3. McFadden worked through the critical statistics. McFadden showed that the estimates are asymptotically consistent, meaning that with enough observations they converge on the true values. He also specified the statistical properties of the model along with estimates of the covariance matrix of the parameters. Even in hindsight it is difficult to comprehend the importance of this paper. It provided in one document the entire system to enable one to analyze choice experiments. For this and other econometric work Dan McFadden last year received the Nobel Prize in Economics. From a marketing research perspective the next critical event in choice modeling occurred eight years later when Jordan Louviere and George Woodworth (1983) published their revolutionary application of discrete choice theory to market research. Whereas McFadden’s work was applied largely to actual market and travel choices, Louviere and Woodworth took that theory and applied it to experimentally designed choice sets. They thus offered the advantages of choices as their dependent variable combined with the control and flexibility of choice experiments designed by the researcher. From conjoint they borrowed the idea of orthogonal arrays and stimulated the use of partially balanced incomplete block designs for choice experiments. Coming five years after Green and Srinivasan’s (1978) classic review of issues in conjoint analysis, Louviere and Woodworth showed how choice experiments using the new conditional logit models could be applied to solve managerial problems that had heretofore relied on ratings-based conjoint analysis. That was almost 20 years ago, and despite its promise choice-based conjoint was quite slow in becoming accepted. One of the reasons was that its major competitor, ratings based conjoint, was itself evolving. These improvements, stimulated largely by Paul Green, Rich Johnson and even Jordan Louviere, made ratings-based methods less vulnerable to competition from the new, elegant upstart. Additionally, the early choice models had a flaw that made their usefulness less than might have been otherwise, a flaw I like to call being hit by a red/blue bus. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 217 HIT BY A RED/BLUE BUS This problem with early choice models arose from a property of multinomial logit readily acknowledged by both McFadden and Louviere. Logit assumes IIA, or independence from irrelevant alternatives, also known as the red bus, blue bus problem. The red bus, blue bus problem is simple and well known. Suppose votes in a population are equally shared between a red bus and a car, each with a 50% choice share. What happens if the choice set is expanded by a second bus equivalent to the red bus except for its blue color? One would expect that the two buses would then split the 50% share leaving the car at 50%. Not so under logit, however, which requires that the new blue bus take share equally from the red bus and the car, resulting in a 33% share for each. The logit adjustment is known as the principle of proportionality, whereby a new alternative takes from current ones in proportion to their original shares. It works fine as a first approximation, but results in counter-intuitive result for the red/blue bus problem, and can have disastrous implications for many managerial decisions. The problem of IIA in managerial choice models becomes more apparent if one considers ways managers want to use conjoint. They use conjoint to • • • Provide demand estimates for a new or revised offering Determine the impact of a new offering on current ones and on competitors Optimize a product line given a company’s cost structure All of these uses depend on the differential substitutability of options. Just as the red bus is much more substitutable with the blue bus than with a car, so are brands sharing the same feature, family brand, or price tier. Ignoring differential substitutability can lead to a consistent bias where one underestimates the degree that a new or repositioned brand takes from a company’s current brands, and overestimates the share that will be taken from dissimilar brands. Particularly in product line decisions, just as one must consider joint cost in minimizing the production cost, so one must consider differential joint demand in accurately estimating sales. A major driver of differential substitutability is differences in values across the population, and the problem can be limited if the model takes into account these differences. In the red/blue bus example, the problem goes away if half the sample is modeled as preferring buses to cars, so that the blue bus cuts only into the bus share. Generally, a simulator with individual utilities for each person has little trouble expressing differential substitutability. In ratings-based conjoint each respondent has a unique utility function and a simulation works much as one would hope. The problem is that, unlike ratings-based conjoint, the early conditional logit models assumed homogeneity across respondents. It turns out that choice methods that account for variation in populations, just like species that better manage diversity in their genetic pools, are best able to survive. The choice models in the mid ‘80s and early ‘90s were not adept at handling variation in preferences, and this limited their pragmatic usefulness. They could provide an overall picture of a market, but could not offer critical descriptive or segment-based detail. The rest of this paper reviews four competing methods proposed to enable choice models to account for heterogeneity: (1) including customer 218 2001 Sawtooth Software Conference Proceedings: Sequim, WA. characteristics in the utility function, (2) latent class, (3) random parameters, and (4) hierarchical Bayes. 1. Including customer characteristics in the utility function. McFadden’s original conditional logit model includes a provision for cross terms that link utility for attributes with customer characteristics. Inclusion of such terms in a market model can account for heterogeneous tastes. For example, the analyst might include terms for say, the value of price given as a function of six age-income categories. Once the logit equation estimates these cross terms, then market share estimation proceeds by a process called sample enumeration. That is, one first estimates shares within each distinct group and then aggregates those estimated shares. The aggregated share estimates do not display the IIA restriction, even if those within each group do. Items that are liked by the same or similar groups will take share from each other, and approximate differential substitution will be revealed. Thus, by including customer characteristics one can in principle avoid a collision with the red/blue bus. However, two problems remained with this solution, the first merely inconvenient, but the second devastating. The first problem is that there can be many parameters in such a model. If there are 20 parameter and 12 customer characteristics that affect those parameters, then 240 parameters populate the fully crossed model. This many parameters can result in a difficult modeling task and runs the risk of producing an overfitted model that registers noise instead of signal. The second problem relates to the characteristics of most marketing choices. Particularly at a brand level, the correspondence between measured customer characteristics and choice is poor, with a typical R2 of less than 10%. The reason for the lack of correspondence is that many decisions, particularly at the level of a brand name or version, reflect accidents in a consumer’s history rather than deterministic components. Customer characteristics may do a good job of predicting how much soda a person will consume, but they are notoriously poor at specifying the particular brand or flavor. As a result of conditioning on factors that bear too little relation to choice, the conditional logit model did a poor job representing heterogeneity in marketing contexts. What was needed was a way to estimate variability in tastes directly on the choices, something achieved by our next three models. 2. Latent class. A latent class logit model assumes that heterogeneity across respondents can be characterized by a number of mass points, reflecting segments assumed to have identical taste parameters (Kamakura and Russell 1989, Wedell et. al 1999). By having enough (say 10) mass points it should be possible to approximate most preference distributions. The latent class program provides the weight for each mass point and its parameters. Estimation of choices shares then involves simply aggregating shares for each segment weighted by its mass. Because the latent segments tend to be highly variable they can both provide an understandable characterization of the market and result in appropriate choice shares reflecting differential substitution among alternatives. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 219 Four years ago Sawtooth Software produced a promising version of latent class, called ICE (Johnson 1997). This method took the latent class solution and built a model for each respondent as a mixture of the different classes. Thus, one person might have a weight of .5 on class 1, a weight of -1 on class 2 and a weight of 1.2 on class 3. The routine then used these predicted utilities to represent the values of each person. It was then possible to put these individual parameters in a choice simulator and develop predictions for each individual, just like the simulator for ratings-based conjoint. It turned out that this method did not work as well as hierarchal Bayes, described below. In my view the problem was not with the counter-factual nature of the assumptions, since all of our models are approximations. Its fault appears to have been more to do with overfitting when trying to approximate each individual as a mix of mass points. 3. Random parameter logit models. Economists such as Dan McFadden were well aware of problems with the conditional logit model in accurately representing variation in tastes. In response, they expanded the model to include the maximum likelihood estimate of the distribution of parameters over the population (McFadden and Train 2000). That is, the routine outputs not just the mean of the parameters, but also provides estimates of their variances and covariances across the population. Simulations would then randomly sample from the distribution of parameters, and aggregate those estimated shares across the population. This ingenious technique solved the red/blue bus problem and gave economist an efficient and elegant way to characterize complex markets. While the random parameter models reflected an important advance for economists striving to understand aggregate markets, they were less useful to those in marketing for two reasons. First, they depend on the form of distribution of preferences assumed. It is not likely that preferences across the population will be normally distributed, but will likely have several segments or regions of higher density. That is not a problem in principle, as different underlying distributions can be tested. However, permitting multiple peaks and complex co-variances increases the number of parameters and complicates the estimation of random parameter models. The second problem is that the random parameter models do not provide utility estimates at the individual level, only estimates of their distribution. Thus, it becomes more difficult to, for example, estimate models for subpopulations. With individual utility functions one can simply group or weight parameters of respondents and then predict their behavior. Random parameter models require that one re-estimate a different model for each group. Thus, from the perspective of a marketing researcher, random parameter models are significantly more cumbersome to implement. HIERARCHICAL BAYES MODELS The final and current contender for modeling choice arose out of a very different research and theoretical tradition. Hierarchical Bayes uses a recursive simulation that jointly estimates the distribution of tastes both across and within individuals. The Bayesian system arose out of a pragmatic tradition that focuses on estimation and distrusts conventional hypothesis tests, 220 2001 Sawtooth Software Conference Proceedings: Sequim, WA. preferring to substitute confidence intervals arising out of a mixture of prior beliefs and sample information. Hierarchical Bayes took the heterogeneous parameter model and modified it both conceptually and in terms of estimation. Conceptually, it viewed the distribution of taste parameters across the population as providing a prior estimate of a person’s values. Those values were combined with a person’s actual choices to generate a posterior distribution of that person’s parameters. In terms of estimation, hierarchical Bayes uses a simulation that generates both the aggregate and the individual distributions simultaneously. It uses the likelihood function not in the sense of maximizing it, but by building distributions where the more likely parameter estimates are better populated. From my perspective the surprising thing about hierarchal Bayes is how well it worked (Sawtooth Software 1999). Tests we ran showed that it could estimate individual level parameters well where respondents only have to respond to as many choice sets as parameters in the model (Arora and Huber 2001). While the HB estimates still contain error, the error appears to cancel across respondents so that estimates of choice shares remained remarkably robust. This robustness stems from two factors. First, hierarchical Bayes is more robust against overfitting. As the number of parameters increase, maximum likelihood tends to find extreme solutions; by contrast, since HB is not concerned with the maximum likelihood, its coefficients reflect a range. By having a less exacting target, a distribution instead of a point estimate, hierarchical Bayes is less susceptible to opportunistically shifting due to a few highly leveraged points. Second, HB is robust against misspecification of the aggregate distribution. Typically the aggregate distribution of parameters is assumed to be normal, but HB individual results tend not to change much depending on that specification, say an inverted gamma or a mixture of normals. The reason is that the aggregate distribution serves as a prior, nudging the individual estimates towards its values, but not requiring it to do so. Particularly if there is enough data at the individual level (e.g. at least one choice set per parameter estimated at the individual level), then the individual posterior estimates will not depend greatly on the form of the aggregate taste distribution. For example, if there is a group of respondents who form a focused segment, they will be revealed as a bulge in the histogram of individual posterior means. Thus, hierarchical Bayes is both less dependent on the form of the aggregate distribution and offers a ready test to see how well that form is satisfied. It should be emphasized that the Bayesian aspect of the hierarchical Bayes model is not what makes it so useful for us in marketing research. The same structural model can be built from mixed logit aggregate distribution used to generate distributions of individual parameters given their choices. The resulting individual means are then virtually identical to those using Bayesian estimation techniques (Huber and Train 2001). What makes both techniques so effective is their use of aggregate information to stabilize estimates at the individual level. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 221 CONCLUSIONS In the last 30 minutes I have offered you a quick and oversimplified story of the evolution of choice models in the last 30 years. I hope you will take with you two conclusions, one about the evolution of science and the second about what it takes to usefully model choices in a market. In terms of the evolution of tools for modeling consumers, the message points to the value of an open mind. Ten years ago I would have no more predicted the success of hierarchal Bayes than I would have predicted that mammals would outlive the dinosaurs. We logically tend to be drawn to the models with which we are most familiar; thus it made sense that originally we would try to model choices with least squares regression, only moving to a maximum likelihood when we had to, and then to hierarchal Bayes when it was shown to be so effective. However, it is those models with which we are least familiar that will certainly make the greatest changes in the coming millennium. In terms of useful choice models, the theme here has been the value of modeling heterogeneity in creating appropriate market share predictions. Individual level models provide the most robust and client-friendly method to estimate the impact of changes in offerings on market share. They have an admitted disadvantage; the choice simulator is not a very useful mechanism for helping an analyst understand the dynamics of a market. However, they are the best mechanism we have for predicting complex changes in share. I believe that one of the major reasons rating-based conjoint has been so successful is because its simulations of individual-level model offered an unrivaled perspective on market behavior. Now, with the magic of hierarchical Bayes applied to logit choice models, ratings-based conjoint is not alone with this advantage. Given that choices are better for data collection, respondents and managers, it is an easy prediction that choice-based conjoint will increasingly dominate its ratings-based cousin, at least until challenged by the next successful mutation. 222 2001 Sawtooth Software Conference Proceedings: Sequim, WA. REFERENCES Green, Paul E. and V. Srinivasan (1978) “Conjoint Analysis in Consumer Behavior: Issues and Outlook,” Journal of Consumer Research, 5 (September), p. 103-123. Johnson, Richard, M (1997) “Individual Utilities from Choice Data: A New Method,” available at www.sawtoothsoftware.com. Louviere, Jordan, and George Woodworth (1983) “Design and Analysis of Simulated Consumer Choice or Allocation Experiments,” Journal of Marketing Research, 20, (November), p. 350367. Huber, Joel and Kenneth Train (2001) “On the Similarity of Classical and Bayesian Estimates of Individual Mean Partworths,” Marketing Letters, 13(3), p. 257-267. McFadden, Daniel (1976) “Conditional Logit Analysis of Quantitative Choice Behavior,” in Paul Zarenbka (ed.) Frontiers of Econometrics, p. 105-142, New York: Academic Press. McFadden, Daniel and Kenneth Train (2000) “Mixed MNL Models for Discrete Response,” Journal of Applied Economics, 15(5), 447-470. Sawtooth Software Inc., (1999) “The CBC/HB Module for Hierarchical Bayes Estimation,” available at www.sawtoothsoftware.com. Thurstone, L. (1927) “A Law of Comparative Judgment,” Psychological Review, 34, 273-286. Wedel, Michel; Wagner Kamakura, Neeraj Arora, Albert Bemmaor, Jeongwen Chiang, Terry Elrod, Rich Johnson, Peter Lenk, Scot Neslin and Carsten Stig Poulsen (1999) “Discrete and Continuous Representations of Unobserved Heterogeneity in Choice Modeling,” Marketing Letters 10:3 219-232. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 223 224 2001 Sawtooth Software Conference Proceedings: Sequim, WA. RECOMMENDATIONS FOR VALIDATION OF CHOICE MODELS Terry Elrod University of Alberta INTRODUCTION How are we to validate choice models? I consider this question in the context of choicebased conjoint; although the theory and solutions offered pertain more generally. This paper points out two mistakes commonly made when validating choice models, explains the consequences of these mistakes, and proposes remedies. I use examples and simulated data to support my claims and demonstrate the efficacy of the proposed solutions. The following characterization of common practice describes both mistakes. I wonder if you will spot them. We observe conjoint choices (or ratings) and holdout choices for a sample of customers. We fit several different models that represent customer differences in different ways. We evaluate these models by calculating their hit rates for the holdout choices for these respondents and adopt the model with the best hit rate. Both mistakes are described in the last sentence. They are: (1) hit rates are used to evaluate and choose among models, and (2) the same respondents are used for both model estimation and model validation. Those of you familiar with the conjoint analysis literature will recognize the frequency with which these mistakes are made, and you may even have made them yourself. I must be counted among the guilty.1 You are entitled to doubt that these are indeed mistakes. Most of this paper is intended to convince you that they are. Fortunately, practical remedies are at hand, which I also describe. Mistake #1: Using Hit Rates to Choose a Model Hit rates are unreliable and invalid. Hit rates are unreliable because hit rate differences are small and noisy and they make inefficient use of the holdout data. Hit rates are invalid because very poor models can have consistently better hit rates than better models. My remedy is to use the loglikelihood of the holdout data rather than hit rates to choose among models. Hit Rates Are Unreliable Because hit rates make inefficient use of holdout data, they have a hard time identifying the better model. I illustrate this with a simple example. There are two alternatives in the choice set, A and B, which are chosen equally often. 1 Elrod, Terry, Jordan J. Louviere and Krishnakumar S. Davey (1992), “An Empirical Comparison of Ratings-Based and ChoiceBased Conjoint Models,” Journal of Marketing Research, 29 (August), 368-77. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 225 Let’s allow the better of two models to have an expected hit rate of 80 percent. On any choice occasion, a choice of either A or B may be observed, and the model may predict either A or B. Suppose the probabilities of observed and predicted choices for a single occasion using this better model are as shown in Table 1. Table 1: Expected Predictive Performance of the Better Model Predict A Predict B Observe A 0.40 0.10 0.50 Observe B 0.10 0.40 0.50 0.50 0.50 0.50 Let’s also suppose the second model has a much worse expected hit rate of 70 percent, as shown in Table 2. Table 2: Expected Predictive Performance of the Worse Model Predict A Predict B Observe A 0.35 0.15 0.50 Observe B 0.15 0.35 0.50 0.50 0.50 The last column and bottom row of each of these tables give the marginal probabilities of the predicted and observed choices, respectively. Note that both models predicted that the two alternatives will be chosen equally often, which agrees with their true choice probabilities. This is by design—I consider the validity of hit rates in Section 1.3. Here the model with the higher expected hit rate is clearly the better model, and we are considering only the reliability of hit rates as a means for model comparison. That is, how often will the better model have a higher hit rate? The answer depends upon two things: the number of holdout choices, and the degree of dependence in the predictions of the two models. Case 1a: Independence in Model Predictions First we will examine the simpler case of no dependence between the two models in their predictions. More precisely, given knowledge of the actual choice, we assume that knowing the prediction of one model does not help us guess the prediction of the other model. (We will consider the dependent case in Section 1.1.2.) The joint probabilities for the two models hitting or missing on any single choice prediction are given in Table 3. The table reveals that the probability of the better model hitting and the worse model missing on any single occasion is only 0.24. This is also the probability that the hit rate criterion will identify the better model from any single holdout choice. There is also a probability of 0.14 that the worse model will be identified as better (it hits and the better one 226 2001 Sawtooth Software Conference Proceedings: Sequim, WA. misses), and there is a probability of 0.62 that the two models will tie (either both hit or both miss). Table 3: Hit/Miss Probabilities for the Two Models, Independent Case Worse Model Better Model Hit Miss Hit 0.56 0.24 0.80 Miss 0.14 0.06 0.20 0.70 0.30 What is the probability that the better model will have a higher hit rate than the worse model given more than one holdout choice? The answer to this question for up to 20 holdout choices is shown in Figure 1. We see from the figure that, even with 20 holdout choices, the probability that the better model will be identified by the hit rate criterion is only 0.71. Figure 1: Probabilities That the Better Model Wins/Ties/Loses, Independent Case Probability That Better Model ... 1.0 Loses 0.8 Ties 0.71 0.6 0.4 Wins 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Holdout Choices Case 1b: Dependence in Model Predictions Now we will consider the case of dependence between the two models and their predictions. It seems likely that one model is more likely to predict a choice correctly if the other model did. An instance of this form of dependence is given by Table 4. The probability that the worse model will appear better on any choice occasion (that is, that it will hit and the other miss) has dropped from 0.14 in the independent case (Table 3) to 0.07 here, but the probability of a tie has increased. Figure 2 shows us the net effect. The probability of identifying the better model with 20 holdout choices has increased only slightly—to 0.76. Table 4: Hit/Miss Probabilities for the Two Models, Dependent Case 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 227 Worse Model Better Model Hit Miss Hit 0.63 0.17 0.80 Miss 0.07 0.13 0.20 0.70 0.30 Figure 2: Probabilities That the Better Model Wins/Ties/Loses, Dependent Case Probability That Better Model ... 1.0 Loses 0.8 Ties 0.76 0.6 0.4 Wins 0.2 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Holdout Choices Holdout Loglikelihood: The Reliable Alternative to Hit Rates What’s more reliable than hit rate? The likelihood (or equivalently, the loglikelihood) of holdout data. The likelihood criterion has been used to fit models for the last 70 years. Maximum likelihood estimation, as its name suggests, seeks values for model parameters that maximize the likelihood of the data. The increasingly popular Bayesian methods for model estimation also depend heavily upon the likelihood. Bayesian estimates are based upon the posterior distributions of parameters. Posterior distributions are always proportional to the likelihood times the prior distribution. Since prior distributions are typically chosen to be uninformative (in order not to prejudice the analysis), the posterior distributions are determined almost entirely by the likelihood. Holdout loglikelihood makes efficient use of holdout data, making it more reliable than hit rates for model selection. Holdout Loglikelihood Defined 228 2001 Sawtooth Software Conference Proceedings: Sequim, WA. The formula for holdout loglikelihood is given by: ( ) HLL = ln ∏ i ∫ [ y i | β][β] ∂β , (1) where i indexes the respondent, yi is a vector of his or her holdout choices, β is a vector of part worths, [·] denotes a probability density, and [·|·] denotes a conditional probability density. The formula of Equation 1 can be understood as follows. Our model tells us the probability of observing holdout choices yi for the i-th respondent given knowledge of his/her part-worths. (This conditional probability is denoted [yi | β].) Our model also specifies a distribution of the part-worths over the population of consumers (denoted [β]). The likelihood of the i-th respondent’s holdout choices yi according to our model is therefore given by: HLi = ∫ [ y i | β][β] ∂β , (2) and the holdout loglikelihood for all respondents is given by Equation 3, which is equivalent to Equation 1. HLL = ∑i ln(HLi ) . (3) A Note on Model Deviance and Likelihood The results of model comparisons are sometimes given in terms of deviance rather than loglikelihood. Deviance, as its name suggests, is a measure of the extent to which the data deviate from what is predicted by a model. The relationship between holdout deviance (which we will denote as HD) and holdout loglikelihood (HLL of Equation 3) is simply HD = −2 HLL + c , (4) where c is an arbitrary constant that is the same for all models fit to the same data. Thus deviance is a measure of model inadequacy equivalent to using loglikelihood and a lower deviance indicates a better model. Holdout loglikelihood (and holdout deviance) for some models can only be approximated. This is because the calculation of the holdout likelihood for each respondent can involve evaluating a mathematically intractable integral (cf. Equation 2). Demonstrating the Reliability of Holdout Loglikelihood We will create two distributions of predicted choice probabilities that might account for the two models’ predictions of Cases 1a and 1b. We didn’t need to specify these distributions when discussing hit rates because the hit rate criterion considers only which choice was predicted and otherwise ignores the probability associated with that prediction. The holdout loglikelihood criterion, on the other hand, makes use of the probabilities associated with the choice predictions 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 229 to help distinguish better models from worse ones. Figure 3 shows, for models Better and Worse, the cumulative distributions of predicted probabilities for chosen alternatives.2 The height of the curve for the worse model at the value 0.5 is equal to 0.3, which means that 30 percent of the time the worse model produces a predicted probability for the chosen alternatives that is less than one-half. (These predictions correspond to “misses.”) A glance at the curve for the better model conforms that its distribution implies a hit rate of 80 percent. Figure 3: Distributions of Prediction Probabilities for the Two Models 1.0 0.9 0.8 0.7 Worse Better CDFs 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Predicted probability of observed choice 0.9 1.0 Efficiency comparisons for the independent case (Case 1a) are shown in Table 5. The first column shows different probabilities of correctly identifying the better model. The second column shows the expected number of holdout choices needed to attain this probability using the holdout loglikelihood criterion. The third column shows the expected number of holdout choices needed using the hit rate criterion. Notice that the numbers in the hit rate column are always much larger, reflecting the comparative inefficiency of hit rates. The final column shows the efficiency of the hit rate criterion, which is simply the second column divided by the third. Efficiency comparisons for the dependent case (Case 1b) are shown in Table 6. You may recall from Section 1.1.2 that hit rates were somewhat more reliable in this more realistic scenario. This increase in reliability is reflected in the larger probabilities of identifying the better model shown in column one of Table 6 relative to those of Table 5. However, in the dependent case reliability of the loglikelihood criterion is improved even more, and hence the efficiency of hit rates is even worse, as can be seen in the fourth column of Table 6. 2 Beta distributions were used, with parameters (1.125,.375) and (.991,.509), respectively. The Beta parameters sum to the same value (1.5) for both models, which holds the degree of consumer heterogeneity constant. 230 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 5: Efficiency Comparisons for the Independent Case Probability of Number of Holdout Hit Rate Identifying the Choices Required Better Model Loglikelihood Hit Rate Efficiency (%) .60 2 10 20 .65 4 14 29 .70 7 19 37 .65 11 26 42 .80 17 36 47 Table 6: Efficiency Comparisons for the Dependent Case Probability of Number of Holdout Hit Rate Identifying the Choices Required Better Model Loglikelihood Hit Rate Efficiency (%) .65 1 12 8 .70 3 15 20 .75 5 19 26 .80 7 26 27 .85 11 34 32 .90 16 47 34 Why are hit rates so unreliable? Hit rates are unreliable because they throw away information about model performance by discretizing probabilistic predictions. For example, for two-brand choice all predicted probabilities of the chosen alternatives greater than 0.5 are lumped into the category “hit” and all probabilities less than 0.5 are lumped into the category “miss.” Throwing away so much information about the models’ predictions greatly increases our risk of choosing an inferior model. Hit Rates Are Invalid A measure (or criterion) is unreliable if it is equal to the true entity being measured on average but rarely provides an accurate measure due to the presence of a lot of random error. In Section 1.1 the unreliability of hit rates was established, explained, and demonstrated. The superior reliability of holdout loglikelihood was demonstrated in Section 1.2. Worse than being unreliable, a measure is invalid if it fails to measure what it is intended to measure even on average. I argue in this section that hit rates are invalid measures of model performance. I base this claim on the fact that very poor models can perform as well as good ones. This is easily shown by means of an example. Consider Table 7, which compares two models with identical hit rates. Notice the difference in the marginal probabilities. The better model predicts that B will be chosen 30% of the time, which is correct, whereas the worse model predicts that it will be chosen only 10% of the time. Since the theoretical hit rates are identical for these two models, no amount of holdout data will allow determination of the better model 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 231 using the hit rate criterion. Each model has an equal chance of obtaining the better hit rate for the data in hand. Table 7: Case 2: Two Models With Identical Hit Rates Better Model Worse Model Predicted Observed Choice Choice A B A 0.55 0.15 B 0.15 0.15 A 0.65 0.25 B 0.05 0.05 0.70 0.30 0.70 0.30 0.90 0.10 This failure of the hit rate criterion has long been recognized. The common “remedy” is to also measure agreement between the observed and predicted marginal probabilities (i.e. choice shares), using a measure such as mean square error of prediction. This solution is frequently inadequate; however, because often the model with the best hit rate is not the same as the model that fits the choice shares best. When this occurs, the researcher must still choose the one best model and there is no theoretical basis for deciding how to combine the two different performance measures into a single measure of model adequacy. There is a simple remedy to this problem, however, which is the same as the remedy for unreliability: use the holdout loglikelihood. It is not only more reliable than hit rates; it is also valid. Holdout loglikelihood optimally weighs aggregate and individual-level accuracy. (This is why we can use this single criterion to estimate our models.) The holdout loglikelihood has no trouble identifying the better of the two models of Case 2 shown in Table 7. MISTAKE #2: VALIDATING MODELS USING ESTIMATION RESPONDENTS Estimating and validating models on the same respondents introduces predictable biases in model selection. It causes us to over fit customer heterogeneity. That is, it biases the model selection process in favor of models that estimate a lot of extra parameters in order to describe customer differences. Fortunately, a good alternative to this practice is also at hand: Validate models using holdout profiles for holdout respondents. The Intuition in a Nutshell The intuition behind the need to use holdout respondents as well as holdout profiles is easy to convey because most of us understand intuitively why a model needs to be validated with holdout profiles rather than with the same profiles used to estimate the model. The same underlying reason explains why a model ought to be validated using holdout respondents. In customer research we have a sample of customers evaluate a sample of profiles. Typically it is not possible to include all profiles of interest in this study. (If it were, we would be doing 232 2001 Sawtooth Software Conference Proceedings: Sequim, WA. concept testing and not conjoint analysis.) Therefore our model needs to generalize beyond the profiles included in the study to the much larger set of profiles that can be generated by combining all levels of all attributes. Similarly, it is rarely possible to include in our study all customers of interest, and our model needs to generalize to the population of customers from which our sample was taken. Thus we want our conjoint model to generalize beyond our sample to other respondents and other profiles. I show in the remainder of Section 2 that this implies we must validate our model using different profiles and different respondents than were used in estimation. My theoretical arguments are backed up with an analysis of simulated data in Section 3. The Principle of Model Validation What I refer to as the principle of model validation is simply stated, and it applies equally to profiles and to customers. The Principle of Model Validation. If a model needs to generalize from a sample to the population from which the sample was taken, the model must be validated using different portions of the sample for validation and estimation. I expect all of us are familiar with what happens when we use goodness-of-fit to estimation data as our model selection criterion—as we add more and more parameters to our model, the fit gets better and better. Here are some examples: • A regression’s R-square keeps increasing as we add more and more independent variables. • A brand-choice model fits the data better when we add brand-specific dummy variables. • A segmentation analysis fits the data better and better as we divide the sample into more and more segments. • A ratings-based conjoint model fits the estimation data best when we estimate a very general model separately for each respondent. In all cases, using goodness-of-fit to estimation data as our model selection criterion favors models that “overfit the data.” Such models estimate a lot of extra parameters to fit the very specific characteristics of the sample data. While the fit to the sample data is excellent, the model predicts poorly for new cases because estimating all those extra unnecessary parameters results in unstable estimates of all parameters. Most of us know that this tendency towards over fitting can be avoided by using as our model selection criterion goodness-of-fit to holdout data—that is, to data not used to estimate the model. We have assumed that, as long as different data are used for estimation and validation, the 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 233 principle of model validation has been satisfied. I show in the next section that this assumption is often wrong, and I illustrate and explain proper application of the principle of model validation. Usual and Better Validation Methods Table 8 portrays the common way of assessing predictive validity for conjoint models. This is the first of six tables which all portray validation methods by the same means. The entire data set at our disposal is represented by all the cells in the table excluding the first row and the first column (which simply contain the words “Customers” and “Profiles”). In the case of Table 8, the entire data set is represented by the last two cells in the bottom row, “Estimation” and “Validation.” Table 8: The Usual Design for Model Validation Profiles Customers Estimation Validation The fact that the Estimation cell spans all the Customer rows (there is only one) but not all of the Profile columns signifies that the estimation data consist of evaluations of only a portion of the profiles by all respondents. The Validation cell shows that the remaining portion of profile evaluations for all respondents constitutes the validation data. Thus all data are used, yet different data are used for model estimation and model validation. So it would appear that the principle of model validation is satisfied. Not so. We have taken our sample from what are referred to as two domains: Customers and Profiles. We would like our model to generalize out-of-sample for both of these domains. Therefore the principle of model validation must hold for each of these domains. The usual model validation method is valid for the profile domain but not for the customer domain. The effect of this error is predictable. It will lead us to over fit inter-customer variability in the data, but not inter-profile variability. An example of a proper validation design for conjoint studies is shown in Table 9. Model validation is performed on different profiles and different respondents. Table 9 shows that the model is assessed on its ability to predict to respondents and profiles not used in estimation. Table 9: A Proper Design for Model Validation Profiles Estimation Customers 234 Validation 2001 Sawtooth Software Conference Proceedings: Sequim, WA. V-fold Cross-Validation A shortcoming of the proper validation method portrayed in Table 9 is that some data are wasted—those observations represented by blank cells in the table. However we can easily avoid wasting respondents (and profiles) by making cyclical use of both for validation. Table 10 illustrates what is known as two-fold validation, which uses all the data at our disposal but also satisfies the principle of model validation. This two-fold validation design would be implemented as follows. First the model is estimated on customers/profiles indicated by Estimation1 and its loglikelihood is calculated for customers/profiles Validation1. The same model is then estimated on Estimation2 and its loglikelihood calculated for Validation2. The performance of the model is given by the sum of its validation scores on Validation1 and Validation2. Table 10: Two-fold Validation Profiles Customers Estimation1 Estimation2 Validation2 Validation1 One might ask: Since the model is estimated twice, which model estimate is used, Estimation1 or Estimation2? The best answer is: Neither. Model validation is used to select the best model. Once the best model is determined, it should be estimated on the entire data set in order to obtain the most precise estimates possible. A Problem With Two-fold Validation, and a Partial Remedy There is a problem with two-fold validation. Assuming that the profiles and respondents are split in half, then Table 10 shows that models are assessed on their ability to generalize to new respondents and profiles when estimated on only one-fourth of the data at a time. In general, the best model to estimate for the entire data set will be more complex than the best model estimated on only one-fourth of the data. More data allow estimation of more parameters with sufficient reliability to generalize well. This difficulty can be partially remedied, but not entirely. The remedy is to partition respondents and profiles into V parts of equal size and choose a value for V greater than two. Then we estimate the model V different times, each time on (V – 1)/V of the respondents and the same fraction of profiles, and evaluate each fit on its loglikelihood for the remaining 1/V respondents and profiles. The performance of the model is equal to its total loglikelihood over all V estimations of the model. The larger the value we choose for V, the closer we are to evaluating models estimated on the entire data set, but the greater the number of times the model must be estimated. Four-fold validation is illustrated in Table 11. It can be easier to think in terms of which customers and profiles are set aside for validation each of the four times, with it being understood that the other customers/profiles will be used in estimation. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 235 Table 11: Four-fold Validation Profiles V1 V2 Customers V3 V4 In Table 11, Vi indicates the validation customers/profiles for each of the four “folds” i = 1, …, 4. That is, the division of customers and profiles into estimation and validation portions for the first fold is as shown in Table 12. Table 12: Four-fold Validation, First Fold Profiles V1 Customers E1 An Allowable Simplification of V-fold Validation While the four-fold cross-validation scheme of Table 11 possesses an elegant symmetry in its treatment of respondents and profiles, it can be inconvenient in practice. While it is a simple matter to partition respondents into V groups and estimate the model on any V – 1 of these groups, the same cannot be said for the profiles. While it is possible to come up with questionnaire designs consisting of V blocks of profiles such that the model may be estimated reliably on any V – 1 of these blocks, this is not easy to do using standard designs, even when V is prespecified. A simpler alternative to the design of Table 11 is shown in Table 13. There the familiar treatment of profiles in cross-validation is adopted—the profiles are divided into two groups, with some used for estimation and others for validation. As explained in Section 2.1, the problem with common validation practice is how it handles respondents, not with its handling of profiles. 236 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Table 13: Four-fold Validation, Modified Profiles Customers V1 V2 V3 V4 Note in Table 13 that the same profiles are used for validation each time. They need not be equal in number to the profiles used in estimation, and in fact they are typically fewer in number. However the model must still be estimated four different times, always on different respondents than those used for validation. The data used to estimate the model that is validated on V1 shown in Table 13 is represented by the bottom three of the four blank cells shown in that table. Full validation of the model would require its being estimated and validated four different times, always using the estimation and validation data as suggested by the table. There are several advantages to the modified cross-validation procedure illustrated in Table 13. The primary advantage is that the problem of experimentally designing the profiles used in estimation is simplified. A second advantage is that the validation procedure is easier to program because we need only concern ourselves with changing the assignments of respondents to estimation and validation, and need not worry simultaneously about reassigning profiles. There are also two disadvantages that warrant mention. First, model selection is more sensitive to choice of holdout profiles because the same profiles are always used for validation. It is still common to give great thought to the design of the estimation profiles but to select the validation profiles in a haphazard manner, even though the efficacy of the cross-validation procedure is sensitive to both decisions. Staying with the old method of choosing validation profiles encourages continuance of poor practice. The second disadvantage is that we still need to estimate the model as many times (V), but with a less reliable result. Much of the drawback to switching profiles used for estimation and validation can be solved by programmers of conjoint software. It is the extra estimation time that is an inevitable consequence of proper and efficient cross-validation, and the modified validation method of Table 13 does not help with this problem at all. Why Not N-fold Validation? N-fold validation involves estimating the model N times, where N is the number of respondents, using a single one of the N respondents as the holdout each time. An important advantage of this technique is that the holdout validation is performed using the maximum number of respondents possible (N – 1) for estimation. The best model estimated using N – 1 respondents cannot be very different from the best model estimated using the entire sample. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 237 However there are two disadvantages to N-fold validation. First, the model must be estimated N different times, which will usually be at least several hundred. Second, if there are numerous respondents in the sample that appear “aberrant” (as far as a model is concerned), every estimation sample will contain all, or all but one, of them. As a result, a model’s inability to properly account for these “aberrant” respondents cannot be readily discovered.3 Other Tips For Validation Here are a few other suggestions for cross-validation. • Begin by estimating the model on all respondents, and use the result as starting values for the estimates in the validation procedure. • Assign respondents randomly to blocks, not sequentially according to their position in the data file. • A V-fold validation may be run any number of times, reallocating respondents to holdout blocks randomly for each run, and the results summed over all runs. (A single “run” of a V-fold validation entails estimating the model V times, each of those times using a different (1/V) fraction of the respondents as holdouts.) Thus a second or third validation run can be used when the first run fails to determine the best model with sufficient certainty. • Validate the model as you would use it. If you will use the model’s point estimates for prediction, validate the model using those point estimates. On the other hand, if you will retain uncertainty in the estimates when using the model, validate it this way. You can even validate a model using it both ways to compare model performance using its point estimates with model performance when you do not. (Although retaining parameter uncertainty in a model is more proper, using point estimates is usually more convenient.) A Simulation For Investigating Mistake #2 I designed a simulated data set to investigate the common practice of using the same respondents for estimation and validation and compare this to my recommendation to use different respondents for estimation and validation. A simulation has the important advantage that we know what the correct model is. We are using simulated data to see whether a model validation procedure succeeds or fails to select the true model. Of course, in the real world the true model is unknown to us. However a model selection procedure that fails to select the true model for simulated data is unlikely to select the best model for our real problems. This is particularly true when, as here, the simulation is simply confirming what we already know from theory. 3 It is for this same reason that the jackknife, which leaves out only one observation, is inferior to the bootstrap. See Efron, Bradley (1982), The Jackknife, the Bootstrap, and Other Resampling Plans. Philadelphia, PA: Society for Industrial and Applied Mathematics. 238 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Data Set Design I simulated 32 pairwise choices by 256 respondents. Sixteen of the choices were used for estimation and 16 for validation. The two alternatives in each choice set are described in terms of ten binary attributes. The attribute levels were assigned to the 32 pairs of alternatives by using a 211 orthogonal array (“oa”) design. The last column of the design divided the choice questions into estimation and holdout choice questions. A nice property of pairwise designs with binary attributes is that they can be generated using ordinary experimental designs for binary attributes. One row of the design suffices to characterize the two alternatives in the choice set. The binary design specifies which of the two levels is assigned to the first alternative in the pair, and the other alternative is assigned the other level for that attribute. The individual-level model contains ten variables for the ten attributes (main effects only) plus an intercept that provides for a tendency to choose either the first or second alternative in each question. Thus eleven coefficients (part-worths) are needed to describe each individual’s choice behavior. These part-worths were made to vary across respondents according to a multivariate normal distribution given by n11 (µ, Σ), where µ′ = (0, 0.1, 0.2,K,1.0) and Σ = diag(1.1,1.0, 0.9,K, 0.1) . (5) Figure 4 shows the distributions of choice probabilities over the sixteen estimation choice questions. Figure 4: Distributions of Choice Probabilities for the Estimation Questions 1.0 0.9 0.8 CDFs 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Choice probabilities The Models Assessed Three different models were assessed by cross-validation. They all correctly represent the individual-level model but they differ in their representation of customer heterogeneity. A fourth “model” is used to provide a benchmark when validating using estimation respondents. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 239 MVN.True Here we use the true values of the parameters of the multivariate normal distribution, [β] = n11 (µ, Σ) , (6) where the values for µ and for Σ are as given in Equation 5. Then the holdout likelihood calculation for each holdout individual is given by HL i = ∫ [ y i | β] n11 (µ, Σ) ∂β . (7) MVN.Sample In this model we assume that we know that the true distribution of part-worth heterogeneity is multivariate normal, but we don’t know the values for the multivariate normal parameters (µ and Σ). However, since we know from our simulation the true part-worth vectors for our sample of respondents, we can estimate µ and Σ from these. These estimates, based on the sample, are denoted using the usual notation µ̂ and Σ̂ . Thus our model of customer heterogeneity is ( ) ˆ , [β] = n11 µˆ , Σ (8) and the holdout likelihood for the i-th respondents is ( ) ˆ ∂β . HL i = ∫ [ y i | β] n11 µˆ , Σ (9) Ind.Mixture In this case we assume we also don’t know that the population of part-worth vectors has a multivariate normal distribution. Instead we use the true part-worth vectors for our sample of consumers, and let these stand in as our estimate of the true distribution of part-worths in the population. Letting i* = 1, …, N* index the N* respondents used for estimation and { βi* , i* = 1, … N*} denote the set of part-worth vectors for these respondents, the distribution of part-worths is taken to be ⎧(1 / N*) if β ∈{β i* , i* = 1,K, N*} , (10) [β] = ⎨ otherwise ⎩ 0 and the holdout likelihood for the i-th respondents is HL i = (1 / N*)∑i* [ y i | βi* ] . (11) When validating on estimation respondents, the i-th respondent of Equation 11 will be among the N* estimation respondents, but when validating on holdout respondents he or she will not. 240 2001 Sawtooth Software Conference Proceedings: Sequim, WA. This model may seem strange to some readers, but in fact it is used by choice simulators that predict shares based on individual-level part-worth estimates. The part-worths for the sample respondents are used to represent the true distribution of customer heterogeneity. Ind.True This final case is for reference only because it cannot be used in cross-validation. No distribution for the part-worths β is specified. We simply use each respondent’s true part-worth vector to calculate the holdout likelihood for his or her holdout choices. This provides us with a performance threshold which no model can be expected to beat. The holdout likelihood for the i-th respondent given knowledge of his or her true part-worths is simply HLi = [ y i | βi ] . (12) Desired and Expected Results We seek the best model for the population of customers from which our sample of respondents was taken. Our validation procedure ought to identify MVN.True as best, MVN.Sample as second best, and Ind.Mixture as worst. MVN.True uses the true model for part-worth heterogeneity in the population of consumers, and we can’t do better than the truth! MVN.Sample is not as good a model as MVN.True because we are using estimates of the multivariate normal parameters based on the sample, which is not as good as knowing and using the true values of these parameters. Ind.Mixture is not as good as either MVN.True or MVN.Sample for the simulated data because it does not make use of the fact that the true distribution for heterogeneity is multivariate normal. Of course for real-world data we won’t have MVN.True at our disposal, and Ind.Mixture may be better than MVN.Sample if the true distribution of customer heterogeneity departs substantially from the multivariate normal distribution. That is why we need a valid and reliable tool that will identify the best model of customer heterogeneity for a given data set. As explained in Section 2, the common practice of using the same respondents for estimation and validation can be expected to lead us to adopt a model that overfits respondent heterogeneity. That is, models which have more freedom to (over)fit the heterogeneity in the sample will appear to perform better than they should. Thus, common validation practice should cause us to prefer Ind.Mixture over the other two models because it has the most freedom to over fit heterogeneity in the sample. It should also cause us to prefer MVN.Sample to MVN.True because it has some freedom to fit heterogeneity in the sample (unlike MVN.True), but not as much freedom to overfit as Ind.Mixture. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 241 Model Comparisons Results When Using Estimation Respondents for Validation Figure 5 plots the results from calculating the holdout deviance for the four models of Section 3.2 using the common validation practice of using all sampled respondents (but different choice sets) for estimation and validation. Figure 5: Model Performance for the Estimation Respondents Deviance 2880 2860 2840 2820 2800 MVN.True MVN.Sample Ind.Mixture Estimates Ind.True The holdout deviance scores (cf. Equations 2–4) are portrayed in the figure as black dots. The holdout deviance for the reference model, Ind.True, can be calculated exactly (Equation 12). Holdout deviance for the other models was approximated by Monte Carlo methods using version 1.3 of WinBUGS.4 Figure 5 also shows the 95% credibility intervals for these approximations. (Increasing accuracy can be attained by running the simulation program longer.) Recalling that lower deviance scores are better, we see that the true model of customer heterogeneity (MVN.True) is shown to perform worst of all. MVN.Sample is shown to perform better. And better still, according to this procedure, is Ind.Mixture. Thus the bias of using the same respondents for estimation and validation is exactly as we predicted. Model Comparisons Results When Using Validation Respondents for Validation Finally, Figure 6 shows the results obtained from using the “modified” four-fold validation design of Table 13. The performance of Ind.Mixture is shown to be much worse than the other two models. Although it appeared to be better than the other two models when employing usual validation practice, it is apparent here that it overfits heterogeneity for the sampled respondents. 4 See Spiegelhalter, D. J., A. Thomas and N. G. Best (1999), WinBUGS Version 1.2 User Manual, MRC Biostatistics Unit. For further information about WinBUGS, consult the URL http://www.mrc-bsu.cam.ac.uk/bugs/. 242 2001 Sawtooth Software Conference Proceedings: Sequim, WA. Figure 6: Model Performance for the Validation Respondents Deviance 3200 3100 3000 2900 MVN.True MVN.Sample Estimates Ind.Mixture The Ind.True reference point cannot be calculated for holdout respondents and is not portrayed. (One can never know the individual-level part-worths for respondents that are not in the sample.) The two best models according to Figure 6 are MVN.True and MVN.Sample. The best model ought to be MVN.True but in fact the better of these two models could not be distinguished with certainty. That is, their 95% credibility intervals overlap considerably. This underscores the need for measures of model performance with maximum power and validity. You may recall from Section 1.1 that the hit rate criterion was criticized for being less reliable than holdout likelihood. We cannot afford to waste information about model performance when choosing among models. IN CONCLUSION This paper identifies two mistakes commonly made when validating choice models, and proposes two remedies. • Don’t use hit rates to “validate,” or choose among, models. Use holdout loglikelihood (or equivalently, holdout deviance) for improved reliability and validity in model selection. • Don’t “validate” models using estimation respondents. Validate on holdout respondents and avoid models that overfit respondent heterogeneity, leading to poor predictions for customers not in the sample. It is worth noting that the common practice of using hit rates to validate models presumes that the same respondents are used for estimation and validation. That is, we obtain estimates of each respondent’s part-worths and use these to predict that same respondent’s holdout choices. As Sections 2 and 3 demonstrate, it is important to validate models on holdout respondents. I see no valid method for validating models using hit rates on holdout respondents and choices, so it appears we have discovered a third reason to avoid using hit rates. 2001 Sawtooth Software Conference Proceedings: Sequim, WA. 243