A research and education initiative at the MIT Sloan School of Management Unilever Data Analysis Project Paper 179 Dimitris Bertsimas Adam Mersereau Geetanjali Mittal June 2003 For more information, please visit our website at http://ebusiness.mit.edu or contact the Center directly at ebusiness@mit.edu or 617-253-7054 UNILEVER DATA ANALYSIS PROJECT BY DIMITRIS BERTSIMAS ADAM MERSEREAU GEETANJALI MITTAL Massachusetts Institute of Technology TABLE OF CONTENTS 1 INTRODUCTION 1.1 1 Summary of Data and Unilever's Previous Data Mining Efforts 1 1.1.1 Panel Data 2 1.1.2 The Most Valuable Consumer and Existing Predictive Models 2 1.1.3 Unilever Database 3 1.2 Project Research Directions 3 1.3 Data Provided 3 2 PREDICTION EFFORTS ON ORIGINAL DATA SET 6 2.1 Data Cleaning 6 2.2 Predictive Methods Investigated 8 2.3 Results 9 2.4 Conclusions 11 2.5 Block Structure in Unilever Data Extract 12 2.6 Analysis of Fall 2000 Survey Data 13 3 NON-SURVEY DATA ANALYSIS 14 3.1 Data Details 15 3.2 Data Processing and Transformation 16 3.2.1 Data Cleaning 17 3.2.2 Data Aggregation 17 3.3 Data Issues 18 3.3.1 Conflicts in Computation of Household Layout Variables 18 3.3.2 Unbalanced Brand Representation 19 3.3.3 Insufficient Information on Source of Information in Non–Survey Data 20 3.4 Predictive Modeling 20 3.4.1 Choice of Models 20 3.4.2 Predictive Efforts using Logistic Regression 22 3.5 GQM Score–Based Stratified Analysis 25 3.5.1 Stratified Prediction Models 25 3.5.2 Predicting GQM Strata for Each Consumer 26 ii Massachusetts Institute of Technology 4 CLUSTERING ANALYSIS 26 4.1 Clustering Background 28 4.2 Cluster Analysis Details 28 4.3 Inference from Cluster Analysis 29 4.3.1 Effects of Unbalanced Brand Representation 29 4.3.2 Details of Cluster 1 30 4.3.3 Details of Cluster 2 30 4.3.4 Details of Cluster 3 31 4.5 Cluster–Wise Predictive Modeling 5 PROJECT SUMMARY AND CONCLUSIONS iii Massachusetts Institute of Technology 32 33 TABLE OF FIGURES Figure 1: Graphic Representation of MVC 2 Figure 2: Brands with Top Ten Consumer Responses 6 Figure 3: Consumer Response Distribution for Overall Data 7 Figure 4: Representative Diagram of Unilever Data Structure 7 Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively 10 Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively 10 Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash 11 Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce 11 Figure9: Graphical Representation of Unilever Data Structure 12 Figure 10: Lift Curves for Logistic Regression Models Based on Fall 2000 Survey Data 14 Figure 11: Graphical Representation of Non–Survey Data 15 Figure 12: Consumer Response Distribution for Non–Survey Data 16 Figure 13: Category – wise Brand Distribution 16 Figure 14: Household Member Age–Gender–Wise Aggregation 18 Figure 15: Consumer Response Distribution for Some of the Brands 19 Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network 21 Figure 17: Lift Curves for Different Unilever Brands 22 Figure 18: Significant Demographic Parameters 23 Figure 19: Graphical Representation of Logistic Regression Model for Gorton Fillets 23 Figure 20: Logistic Regression Model Coefficients for Gorton Fillets 24 Figure 21: Comparative Lift Curves for Models with and without GQM Scores 24 Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models 25 Figure 23: Cluster Pie Chart 28 Figure 24: Cluster Statistics 28 Figure 25: Importance Value of Significant Variables 29 Figure 26: Input Means for Cluster 1 30 Figure 27: Input Means for Cluster 2 31 Figure 28: Input Means for Cluster 3 31 Figure 29: Comparative Lift Curves for Cluster – based and Overall Models 33 iv Massachusetts Institute of Technology 1 INTRODUCTION This report describes and concludes the data analysis project undertaken in collaboration with Unilever through the Sloan School of Management Center for eBusiness. In this document we trace our interactions with Unilever, describe the data made available to us, describe various analyses and results, and present overall conclusions learned in the course of the project. Unilever has been a pioneer in mass-marketing, which focuses on widely broadcast advertising messages. This marketing approach is at odds with new trends towards targeted marketing and CRM (Consumer Relationship Management), and Unilever is interested in investigating how the new marketing philosophy applies to the packaged consumer goods industry in general and to Unilever in particular. As Unilever sells its products not directly to consumers but through a variety of retail channels, they have indirect contact with the end consumers. Thus the application of CRM (defined by Unilever as “Consumer Relationship Management” rather than the more standard “Customer Relationship Management” to make this distinction) is less obvious in their business. In particular, Unilever recognizes three clear obstacles to the application of CRM ideas in the packaged goods industry: • Consumer transaction data is difficult to procure in the packaged goods industry. • Data mining expertise and experience are new to packaged goods companies. • The packaged goods industry marketing efforts focus on brands rather than on consumers. Unilever employs data mining in the area of CRM. This effort is partly undertaken in the Relationship Marketing Innovation Center (RMIC), which is a group that transcends Unilever’s individual brands. RMIC Unilever’s project with MIT has been part of these efforts. Especially in light of the second bullet point above, the MIT team was asked to help evaluate the potential of data mining technology for Consumer targeting at Unilever’s business. Unilever was to make available to the MIT team representative samples of the data at Unilever’s disposal, and the MIT team was to analyze this data and research new data mining methods for making use of this data in a targeted marketing framework. 1.1 Summary of Data and Unilever’s Previous Data Mining Efforts This section summarizes our understanding of the data Unilever has available for analysis, as well as Unilever’s previous data mining efforts with this data. A primary data source is a Unilever database of consumers. The database contains information on individuals and households who have interacted with Unilever in some fashion in the past. The data includes demographic and geographic information as well as self-reported usage and survey information. 1 Massachusetts Institute of Technology 1.1.1 Models Based on information available on a subset of Unilever consumers two models were fit, the so-called Demographic and Golden Question models, for predicting if a consumer is an MVC (“Most Valuable Consumer”) as measured by their dollar spend on Unilever’s collection of brands. The concept of MVC and the two models are described in more detail in the next paragraph. 1.1.2 The Most Valuable Consumer and Existing Predictive Models Much of Unilever’s data mining efforts prior to August, 2001, were focused on identifying the “most valuable consumers” (MVCs) on a brand- and company-wide level. The MVC for a specific brand is a consumer determined to spend highly on the Unilever brand and on the industry category. Specifically, rank consumers both by their dollar spend on a Unilever brand and by their dollar spend in the corresponding industry category. Individuals are categorized as a “heavy”, “medium”, or “low” brand or category consumer. The MVCs for a given brand are generally defined as those consumers found in the shaded regions of the following table. H M L L M H profitability to Unilever brand Figure 1: Graphic Representation of MVC The concept of MVC can also be extended to the level of the overall company. Unilever’s Demographic and Golden Question models are logistic regression models that estimate the probability that an individual consumer is an MVC in terms of their expenditure to Unilever as a whole. The demographic model uses demographic variables exclusively, while the Golden Question model uses both demographic input as well as a minimal set of survey responses about product usages. These models are used to score and rank individuals in the Unilever database. 1.1.3 Unilever Database The Unilever Database is a large data warehouse owned by Unilever but maintained by Axciom, a data warehousing company. The database includes varying amounts of information on Unilever consumers, compiled from a number of sources. For each consumer, the database potentially reports on: • Demographic data at the individual, household, and geographic block levels • Responses to promotional events 2 Massachusetts Institute of Technology • Survey responses • Predictions of Demographic and Golden Question models The database contains no transactional purchase information, although it does provide self-reported brand usage data, model predictions, and contact history information. The accuracy of the self-reported brand usage data varies by brand. 1.2 Project Research Directions At a project kickoff meeting in Greenwich, CT in August, 2001, we discussed several research directions of interest to both MIT and Unilever: • Investigate methods for predicting, characterizing, and clustering individual consumer usage of products. • Experiment with alternate definitions of MVC. • Develop alternative models to predict MVC. • Develop dynamic logistic regression models—that is, prediction methodologies based on logistic regression that evolve in time. • Develop optimization-based logistic regression subset selection methodologies. Specifically, how can we use such a methodology to design a questionnaire of maximum value and minimum length? Although this list includes a number of items that may be of interest to Unilever in the future, the majority of our efforts were focused on the first of these topics. This decision was largely guided by the data made available to us, which includes no information on consumer profitability and contains limited time stamp information. In subsequent sections of the report, we will revisit the data limitations and the role of the data in guiding our analysis. 1.3 Data Provided We received several data files from Unilever on a CD dated November 30, 2001. The files and our understanding of them are as follows: • “UNIFORM.TXT”: A large data file sampled from the Unilever Database. The extraction was performed via uniform random sampling from the most complete and reliable records from the database. • “STRATIFIED.TXT”: A large data file sampled from the Unilever Database. This file is similar to UNIFORM.TXT except in the method of sampling. Stratified sampling was used and stratification was done with respect to the Golden Question model scores. • “MIT LAYOUT #1 AXCIOM.XLS”: A list of variable names and layout for the data in UNIFORM.TXT and STRATIFIED.TXT. 3 Massachusetts Institute of Technology • “LAYOUT DESCRIPTION #1.XLS”: A list of variable names along with a brief description of many of the variables. • “MIT LAYOUT #2 INFOBASE.XLS”: An enumeration of InfoBase data entries, which form some of the variables in UNIFORM.TXT and STRATIFIED.TXT. Many of the variables described in MIT LAYOUT #2 INFOBASE.XLS do not appear in the data files. • “DATA DICTIONARY JUNE 2001.DOC”: An enumeration of InfoBase variables, only some of which are included in the data files. • “BRAND ID&NAME.XLS”: A table matching brand ids to brand names. • “MARKET.TXT”: A table matching Market Codes to county names. • “VIC’S REPLY.TXT”: Some detail clarifications on LAYOUT DESCRIPTION #1.XLS. We were subsequently given the following file: • “MASTER.XLS”: A table matching brand ids to brand names, franchises, and product categories. We were provided no other description or information regarding the data. The UNIFORM.TXT data file has a raw size of 114 Mb, and includes 367 variables for 46,307 observations. The STRATIFIED.TXT data file has a raw size of 133 Mb, and includes the same 367 variables for 53,693 observations. Many of the observations in both data files include significant missing data. To summarize these data sets, we provide a brief discussion of the sets of variables included in UNIFORM.TXT and STRATIFIED.TXT: • Individual Layout: This set of variables includes individual and household ID codes, individuals' names, and basic demographic variables for age and gender. Other demographic variables like ”Marital Status,” ”Employment Status,” “Occupation Type,” and “Ethnic Code” have significantly large number of missing values. • Household Layout: This set of variables includes data specific to a household (note that a household may include multiple individuals), and is a result of the merging of the Unilever Database with data from third-party sources. Third-party data offers information on household vehicle ownership, the distribution of genders and ages in the household, occupations, home ownership, credit card ownership, and membership in a number of lifestyle clusters (e.g. “traditionalist,” “home and garden”). Most of these fields have at most 20% missing data. • Demographic Model: This set of variables gives results of the Demographic logistic regression model for prediction of MVC. The most important field here is the “Model Score” which gives values between 0 and 1, the model’s prediction. The field “Model Score Group” denotes the decile in which the model scores falls. Deciles are defined according to the Demographic model results. 4 Massachusetts Institute of Technology • Golden Question Model: This set of variables gives results of the golden question logistic regression model for prediction of MVC. The most important field here is the “Model Score” which gives values between 0 and 1, the model’s predictions. The field ``Model Score Group'' groups the observations into deciles according to the Golden Question model results. The Demographic and Golden Questions models are positively correlated. We have measured a correlation of 0.6 on the UNIFORM.TXT data. • Block Group: This set of variables describes the block the household belongs to. A block is an address-based segment of the population. Thus, there are likely multiple households per block. These variables essentially provide demographic information about the geographical neighborhood of the household. Variables describe the urban/rural breakdown, the ethnic breakdown, the distribution of home valuation, employment breakdown, education level, etc. • Brand Usage Layout #01-#20: For each individual, there are 20 sets of brand usage variables. Thus each individual is associated with at most 20 brands. The brands are a mixture of Unilever and non-Unilever brands. A total of 259 brands appear in the UNIFORM.TXT dataset, with 96 chosen by at least 100 individuals, 55 chosen by at least 1000 individuals, and 17 chosen by at least 10000 individuals. The average individual reports interaction with 9.5 distinct brands. In our analysis we concentrated on the file UNIFORM.TXT instead of the file STRATIFIED.TXT, because it seemed appropriate to use a representative sample of the underlying data set. 2 PREDICTION EFFORTS ON ORIGINAL DATA SET Our initial efforts were towards developing methods for predicting usage of individual brands using demographic variables and cross-purchase information as inputs. We were particularly interested in methodological innovations for using these different sets of variables to make useful predictions. At this stage of the project we chose to concentrate on the prediction of usage for individual brands due to the following reasons: • Prediction of brand usage is of obvious use in targeted marketing • A lack of information on consumer profitability and limited time stamp information eliminated several of the topics mentioned in section 1.2 5 Massachusetts Institute of Technology • The problem is of general interest to Unilever and other packaged goods companies that have large amounts of data over a wide range of products. 2.1 Data Cleaning In efforts to clean the data set for analysis, we first performed a brand aggregation using the Franchise and Category information provided in the MASTER.XLS file. The reason for this was to eliminate the distinction between very similar products. For example, we judged that different flavors and versions of the same product should appear the same from the perspective of a company-level analysis. The new brand labels in most cases can be mapped oneto-one to the original notion of brands. After eliminating those brands with reported usage by fewer than 100 individuals, we were left with 76 unique brands for analysis in the UNIFORM.TXT data set. Figure 2 shows the brands with top ten consumer response and Figure 3 shows the distribution of reported usage among the 76 brands. Note that the patterns of reported usage in this data is not representative of the actual sales of these products. # Consumer Response Product 9367 6643 6146 4989 4938 4923 4640 4554 4399 4302 Suave Shampoo / Conditioner Dove Bar Soap Ragu Pasta Sauce Lipton Tea Bags Dial Bar Soap Bath & Body Works Bar Soap Lever 2000 Bar Soap Good Humor / Breyers Ice Cream Other Body Wash Dove Body Wash Figure 2: Brands with Top Ten Consumer Responses 10000 # Customers Using Product (13870 tot 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1 5 9 13 1 7 21 25 29 33 37 41 4 5 49 53 57 6 1 65 69 7 3 Figure 3: Consumer Response Distribution for Overall Data 6 Massachusetts Institute of Technology For each of these 76 products, we developed a binary indicator indicating the reported usage for all the consumers, with 1 representing a positive usage response and 0 representing no indication of usage. In addition to the 76 brands, we focused on the following four demographic variables in our prediction efforts: • Presence of Child (Yes/No) • Household Size • Income Category • Geographic Region (North, South, Midwest, West) These variables were chosen because they included relatively few missing values and because they generally represent important individual and household indicators which act as proxies for underlying factors that influence purchase behavior. The resulting cleaned data thus contained four demographic variables and 76 binary reported usage BRAND … BRAND BRAND Region Income Category Househol d Size Child Individua l variables for each of the 46307 consumers. A representation of this data is as follows: 1 2 … 46307 Demographic variables Usage variables Figure 4: Representative Diagram of Unilever Data Structure As is common practice in data mining studies, to deal with the so–called “over–fitting” problem the UNIFORM.TXT data was randomly partitioned into a training set of 27,731 consumers for fitting of model parameters, a validation set of 9,291 consumers for choosing among models, and a test set of 9,285 consumers for measuring final results. 2.2 Predictive Methods Investigated Logistic regression has a long history of use in targeted marketing for linking demographic variables and purchase behavior, while collaborative filtering is an approach that has developed with the rise of ecommerce for predicting a consumer’s preferences based on the preferences of similar consumers. Our 7 Massachusetts Institute of Technology interests in this study were in investigating the relative merits of these two methodologies and of combining their results in various ways. Logistic regression establishes a relationship between predictor variables and a response variable via the logistic function. In particular, we model a consumer’s probability of response p in terms of a set of predictor variables x1, …, xn as follows: exp ∑ β i xi i p= 1 + exp ∑ β i xi i Logistic regression has often been used in marketing contexts, and has the advantages that the model is interpretable and can be fit using efficient methods. It can be susceptible to overfitting, however, when there are many predictor variables. Collaborative filtering models an individual consumer’s response as a weighted average of the responses of other consumers in the database, where the weighting is typically according to a similarity measure among consumers. The approach is appropriate in applications where there are sufficiently large numbers of products to allow computation of a useful similarity measure among consumers. The collaborative filtering approach has proved useful particularly in internet recommender systems. Examples include the recommendation engines employed by Amazon and Netflix. Collaborative filtering is not as easily interpretable as logistic regression, but has the advantages that it is conceptually simple and is adept at handling many variables representing choices among a large number of possibilities. In our implementation, we compute similarities among consumers based on reported usage information only. Since this is binary data, we require a suitable similarity measure. We have made use of the socalled Jaccard similarity. Given usage vectors of two individual consumers, define a to be the number of products the two consumers have in common, b to be the number of products unique to consumer 1, and c to be the number of products unique to consumer 2. Then the Jaccard similarity is given by the ratio a/(a+b+c). The Jaccard measure thus takes into account products the two consumers have in common, but ignores items that neither have chosen. As our reported usage data is relatively sparse, there are likely to be many products chosen by neither. With the Jaccard similarity measure, these products will not inflate the similarity measure as they would with, say, a correlation measure. We also made use of a weighting scheme that weighs rare products more heavily than common ones. Such a modification is based on the observation that selection of a rare product is more informative than selection of a common product. Such an “inverse frequency” weighing is common in collaborative filtering systems. 8 Massachusetts Institute of Technology Initial tests of these methods indicated that we might be able to produce more accurate predictions by using the predictions of a logistic regression model and a collaborative filtering model as inputs to a third model. We considered three methods for combining these models: a weighted average of the two results, another logistic regression taking the two results as inputs, and an optimization model that computes a linear discriminant in the space of the logistic and collaborative model outputs. Upon further testing, we decided that the logistic regression method for combining models exhibits the best performance. In the end, we examined several models for predicting individual product usage from the demographic and other product usage data: • RAND : Predictions are random numbers between 0 and 1. This is less of a prediction method than a baseline for comparison. • LOGIT : A logistic regression model using demographic variables as predictors. • COLLAB : The collaborative filtering model using reported usage data from other consumers. • COMB_LOGIT : A logistic regression model using the LOGIT and COLLAB model predictions as predictor variables. • FULL_LOGIT : A logistic regression model using the demographic variables as well as the usage variables for products other than the one being modeled. We implemented two versions of the FULL_LOGIT model: one using all variables available, and one which uses subset selection methods to choose an accurate model with only 5 variables. 2.3 Results We present our results for the various methods in the form of lift curves. To compute a lift curve for a given product, we use a method described above to assign each consumer in the validation set an estimated probability of usage. We then rank the consumers in order of their predicted usage, and ask the question “If we contacted the top M consumers in this list, how many actual users would we find?” If we repeat this question for every choice of M, we obtain the lift curve. Random predictions should roughly give a straight diagonal line, while effective prediction algorithms will result in lift curves as high as possible above this diagonal line. Higher lifts indicate better performance of a predictive model. 9 Massachusetts Institute of Technology Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively Figure 5 includes lift curves for the models used to predict usage of the products Caress Body Wash and Classico Pasta Sauce, noting that these are indicative examples of results obtained for several products. The first set of lift curves illustrates the results for the COLLAB, LOGIT, and COMB_LOGIT methods. We observe that the while the LOGIT model using the four demographic variables does better than random prediction, the COLLAB model using usage information of other products does significantly better. Combining the two models does only slightly better than the collaborative filtering approach. The following set of lift curves adds the FULL_LOGIT method: Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively The FULL_LOGIT model, based on all the variables, performs even better than the COMB_LOGIT model. This logistic regression model using only a few selected variables exhibits surprisingly good performance. This observation motivated a closer look at the specific coefficients of the variables in these parsimonious 10 Massachusetts Institute of Technology models. The tables below give the coefficients for the two 5-variable models responsible for the lift curves above: Target: Caresss Body Wash FULL_LOGITS Coefficients Intercept -2.47 Caress Bar Soap 1.61 Lever Body Wash 0.59 Oil Olay Body Wash 0.51 Herbal Essence Body Wash 0.46 Dove Body Wash 0.45 Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash Target: Classico Pasta Sauce FULL_LOGITS Coefficients Intercept -2.89 Five Bros Pasta Sauce 1.92 Francisco Rinaldi Pasta Sauce 1.31 Prego Pasta Sauce 0.52 Breyer’s Ice Cream -2.11 Lipton Tea Bags -2.53 Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce Thus, among the most useful variables for predicting Classico Pasta Sauce usage are other pasta sauce brands, while other brands of body wash are among the most useful information for predicting usage of Caress Body Wash in the overall data set. 2.4 Conclusions Our analysis of the overall UNIFORM.TXT data set led us to some intriguing conclusions, and motivated a closer look at the data set and its sources. Methodologically, while combination models are interesting, logistic regression, perhaps with subset selection, is a sufficiently powerful method for analyzing this data. Furthermore, it has the added advantage of interpretability and is well understood as a tool in marketing. 11 Massachusetts Institute of Technology Our primary finding was that in this data set, reported brand usage variables are considerably more powerful than the limited set of demographics we looked at. In particular, usage of a given product can be predicted surprisingly accurately using usage data from a small number of closely related products. These pronounced and powerful trends encouraged us to take a closer look at the underlying data. After discussions with our counterparts at Unilever, it was revealed that much of the data we were working with was aggregated from two consumer surveys. Indeed, one of the surveys asked questions regarding personal washes, while the other did not. Also, one of the surveys included questions regarding pasta sauces, while the other did not. Clearly, the differences between the two surveys largely explained the high correlation we were observing among usages of similar products. Thus, our most significant conclusion from this analysis was that our models were achieving impressive results, but were likely modeling the data collection technique rather than the underlying phenomenon. Unfortunately, such a model may generalize poorly to panel data or to real world situations. Such survey data, in an aggregated format, may serve as a poor proxy for purchase behavior. The results of this analysis motivated more work to understand the source of the data. Future efforts were focused on identifying portions of the data that were as uniform as possible, and forming prediction and clustering using relatively simple modeling techniques such as logistic regression. 2.5 Block Structure in the Unilever Data Extract The previous analysis motivated a closer look at the data. At this point Unilever provided us copies of two questionnaires, the responses to which comprised a large portion of the data in the Unilever database extract. Using timestamps that indicated the dates of collection for the various responses, we were able to obtain a more detailed understanding of the data. This structure is indicated in the following figure: Demographic variables Consumer 1 .. Consumer 46307 D M G Q M Usage variables S2000 survey responses Non-Survey / “Coupon” responses F2000 survey responses 39 brands 11 Unilever brands 70 Figure9: Graphical Representation of Unilever Data Structure 12 Massachusetts Institute of Technology In figure 9, rows indicate data available for a single consumer, while columns indicate different variables in the data extract. Some demographics and model scores are reported for each consumer. In the section marked as “Usage Variables,” the shaded blocks indicate the presence of self-reported usage data. After some investigation, we believe that roughly half the consumers had Spring 2000 survey responses and no Fall 2000 survey responses, while the other half had Fall 2000 responses and no Spring 2000 responses. A subset of consumers from both groups also had some additional brand usage responses, which we hereafter refer to as “Non-Survey” data. We were advised that this “Non-Survey” data largely represented responses to coupon redemption. In subsequent analyses, we concentrated on sets of brands and consumers whose usage data fell uniformly in one of these blocks. The individual surveys reported on a relatively small set of brands, while the “Noncoupon” data included a much larger set of brands. For this reason, we focused our efforts on the NonSurvey data. In what follows, we will briefly discuss our limited modeling efforts on the Fall 2000 survey data, and we will provide a lengthy discussion of an extensive analysis of the Non-Survey data. 2.6 Analysis of Fall 2000 Survey Data While our efforts were focused on the Non-Survey data, we also performed an exploratory analysis of the Fall 2000 survey data and associated demographic variables. We extracted a small sample of Fall 2000 data that included 2500 consumers. The Fall 2000 survey data includes response data on a limited number of brands. The Unilever brands represented are Suave, Lipton, Breyer’s, Wishbone, Dove, Lever2000, Caress, and Snuggle. This limited amount of response information restrained the scope of analysis we could perform, and hence we tried the following three indicative tasks: • Predicting reported Caress usage given all other available variables. • Predicting reported Caress usage given demographic variables only. • Predicting simultaneous Caress and Snuggle usage given all other available variables. The third task was an attempt to use predictive modeling to identify cross-selling opportunities. We used only the demographic variables described above: described region of residence, income levels, household sizes, and presence of children. We tried several modeling methodologies including nearest neighbor methods, logistic regression, discriminant analysis, classification trees, and neural networks. These methodologies will be described in more detail in a subsequent section. We concluded that the choice of modeling algorithm seemed to make insignificant difference in the quality of the results obtained. The best-performing models predicted 100% non-usage, which gave a 32% misclassification rate for Caress, and a 19% misclassification rate for Caress / Snuggle cross-sells. These results indicated that it is difficult to make predictions given the limited number of variables. 13 Massachusetts Institute of Technology # responses 350 300 Caress lift curve (demographic and response predictors) 250 Caress Reference 200 Caress&Snuggle lift curve 150 100 Caress&Snuggle Reference 50 0 0 500 1000 # targeted 1500 Caress lift curve (demographic predictors) Figure 10: Lift Curves for Logistic Regression Models based on Fall 2000 Survey Data On constraining the models to generate a reasonable number of usage predictions, the best models achieved 35% misclassification rate for Caress and 21% misclassification rate for Caress / Snuggle cross-sells. Figure 10 shows lift curves from the logistic regression models. We observe that the model making use of demographic variables only gave insignificant lift. The models based on both demographic and brand usage information seemed to achieve more lift. The most useful predictor variables seemed to be other brands of soap – namely Dove and Lever2000. While this may reflect real consumer usage pattern, it may also be due to the design of the survey, which included separate sections for soap and for other products. As with the other analyses in this report, the question remains as to whether these results are transferable to real-world usage patterns. 3 NON–SURVEY DATA ANALYSIS Here we report on a subset of the original UNIFORM.TXT which we refer to as the “Non–Survey” data. Our decision to analyze this section of the data was guided by the internally homogenous composition of this data set and the fact that it included information on a wide range of products. • Data Details: A detailed description of the Non–Survey data extraction and composition highlighting some of the inherent features. • Data cleaning and aggregation: A detailed description of treatment of missing values or outliers and transformation of some of the variables that we conducted. • Data credibility issues: A few of our observations suggesting the possibility of artificial data structure and data bias issues. 14 Massachusetts Institute of Technology • Predictive modeling: An in - depth analysis to predict brand usage and explore the Most Valuable Consumer concept using various modeling techniques based on both demographics and Golden Question Model and Demographic Model scores. • Cluster analysis: A description of efforts to segregate consumers into distinctive clusters and to use this potential information to enhance predictive efforts. 3.1 Data Details The layout of data provided by Unilever was explained above. The Non–Survey data has been extracted from the file UNIFORM.TXT. This file contains information on 46,307 consumers. Each usage entry in the file UNIFORM.TXT bears a date stamp indicative of time of data collection. A majority of data entries in this file bear one of the two time stamps – namely 15th May, 2000 and 15th November, 2000. Based on the quantity of usage data associated with these two dates and the fact that the data with these time stamps seems to correspond to the surveys provided to us, we assumed that usage entries with these time stamps correspond to the Spring and Fall survey data. The remaining data is what we analyzed in this section and has we refer to it as the Non–Survey data. As per information provided by Unilever, we believe the NonSurvey data represents product promotion coupon responses. The Non–Survey data consists of 14,492 consumers. For each consumer we have demographic information and reported usage for seventy brands, some of which are Non–Unilever brands. In addition, Golden Question Model and Demographic Model scores have been provided for each consumer. The diagram below is a graphical representation of the data layout and structure. Usage Demographic variables variables G D Q M User1 1 Consumer S2000 survey responses Non-Survey / “Coupon” .. Consumer 46, Use307 r 46307 F2000 survey responses 39 brands 11 Unilever brands 70 brands Figure 11: Graphical Representation of Non–Survey Data Each consumer reports usage of at most twenty brands. Following the data cleaning efforts, the maximum number of brands reported by a consumer was reduced to seventeen. The following diagram shows a distribution of consumers according to the number of responses reported by each consumer. It is observed 15 Massachusetts Institute of Technology that a large majority of the consumers report usage of very few brands, which leads to the sparse nature of the Non–Survey dataset. Consumer Response Distribution # Consumers 3000 2500 2000 1500 1000 500 0 1 3 5 7 9 11 13 15 17 # Responses Figure 12: Consumer Response Distribution for Non–Survey Data Figure 13 depicts category–wise brand distribution of Unilever and non–Unilever brands in the Non– Survey data set. There is a dominating presence of body wash and bar soap brands in the data set, followed by the presence of food items. This is because of the survey design and, therefore, the distribution is not representative of true usage patterns, a potential bias that we will investigate later in the discussion. Distribution of Brands Misc # Brands Body Items Food Items Detergents Shampoo Body Wash Bar Soap 0 8 16 24 32 40 Brand Categories Figure 13: Category – wise Brand Distribution 3.2 Data Processing and Transformation In contrast to our initial efforts, during the analysis on Non–Survey data we sought to incorporate a wide range of demographic and model score variables for a more comprehensive study. This necessitated numerous decisions regarding data set preparation like choice of demographics and treatment of missing values and outliers. 16 Massachusetts Institute of Technology 3.2.1 Data Cleaning Following the data cleaning efforts as mentioned in section 2.1, Non-Survey data set was further cleaned and filtered. The total numbers of consumers to begin with were 14,492, which were reduced to 8,608 consumers after data cleaning and filtering. As stated earlier, the Non-Survey data had over 100 demographic variables and 70 brand variables. All the brand variables have been considered in the analysis regardless of affiliation with Unilever. Block Layout variables were not considered in this analysis due to their complex nature and to expedite the process. We believed the information contained in Household and Individual variables was significant enough for revealing relevant information. Among the Individual Layout and Household Layout demographics, variables with an excess of 20% missing values were eliminated. Imputing missing values for such a large number of data entries would have led to misleading results. Some of the demographic variables were rejected due to the ambiguous nature of their source and method of computation. Certain variables that seemed to be derived from other variables in unclear ways were also rejected. Examples of such variables include lifestyle clusters like traditionalist, home garden, etc. For important demographic variables, all consumers with missing values were eliminated. For the remaining demographics that were measured on a continuous scale, missing values were imputed with the mean values. Outlying consumers whose corresponding demographic values were greater than or less than five standard deviations from the mean were also eliminated. This led to negligible number of consumers with missing values among demographics measured on a binary scale. These were also eliminated. 3.2.2 Data Aggregation Certain demographic variables were aggregated for the purpose of obtaining simpler models which are more interpretable, to decrease the computational effort and to deal with potential model over – fitting. State – Wise Regional Aggregation Variable FIPS_census_state contains information on the state of residence of the consumer. These were aggregated into the following nine regions. • New England • Middle Atlantic • East North Central • West North Central • South Atlantic • East South Central • West South Central • Mountain • Pacific This aggregation was necessitated for decreasing computational and time complexity. In addition, we believed that a region – wise approach will be more insightful. 17 Massachusetts Institute of Technology Household Member Age – Gender – Wise Aggregation Variables containing information on age and gender–wise presence of household members were aggregated. These are variables of type IB_males_0_2, IB_females_3_5, etc. Presence of Male 0-2, 3-5, 6-10, 11-15, 16-17 years Presence of Male 18-24, 25-34, 35-44, 45-54 years Presence of Male Child (Non-Earning) Presence of Male Adult (Earning) Presence of Male 55-64, 65-74, 75 plus years Presence of Male Senior Figure 14: Household member age – gender – wise aggregation The variables were grouped into presence of child/ adult/ senior member of male/ female/ unknown gender. The age threshold for segregation was chosen based on certain assumptions about occupational status of each household member according to the individual’s age. This is made clearer in Figure 14. The variables were combined into children, adult and senior categories as indicated above. The same treatment was extended to variables for female and unknown gender also. This led to a compression of 36 variables into 9 variables, with the preservation of age and gender–wise composition of the household and their influence on consumer response. Prior to our meeting with Unilever representatives in July 2002, we assumed that household members in the age group of 16 to 17 could be included in the adults category. However, we were informed that Unilever considers consumers up to the age of 17 as children. We made necessary modifications to the data set thereafter but no significant changes were recorded due to this minor modification. 3.3 Data Issues Prior to extensive predictive modeling efforts, the data was observed and examined to extract information that might be useful in subsequent study. We observed several instances that suggested data inconsistencies and possible biases in the data. Some of these issues relate to the means and methods of data collection and interpretation, while others relate to possible artificial data structuring induced by aggregation of dissimilar data from multiple sources. Presented below are a few cases in point. 3.3.1 Conflicts in Computation of Household_Layout variables Household_Layout variables consisted of certain variables that supply gender-wise and age-wise information on presence of household members. Examples include IB_males_0_2, IB_females_3_5, etc (henceforth referred as household-member variables). A positive response indicates presence of a 18 Massachusetts Institute of Technology household member in that age and sex group. This information was not extracted from the consumer directly, rather derived from third-party data source. The accuracy if the data varied depending on source of data. The following situations led us to doubt the accuracy of some of the information contained in these variables • The data set also contains a variable called IB_house_size. Addition of unit responses in all the aforementioned variables describing presence of household members should not exceed the value indicated by variable IB_house_size. Yet it was observed that there was no correlation between the aggregated value as obtained from the household-members variables and the house size variable. We tried various combinations, with the inclusion or exclusion of unknown gender type members, yet we failed to achieve a match in between the two sets of variables. • Similarly there was no reconciliation between the values represented by variables IB_presence_of_child or IB_number_of_adults and the information aggregated using various combinations of household-member variables. The above data inconsistencies suggests that caution must be exercised in choosing variables to be considered in the modeling analysis. Careful analysis of variable definitions and calculations must also be done. To capture information on age and gender of household members, we chose to use the aggregated household – member variables. 3.3.2 Unbalanced Brand Representation Consider the following distribution of consumer response for some of the brands in the Non–Survey data. # Response per Brand Suave BarSoap Brand Name Mealmk StirFry lvr2k Bodywash Ragu PastaSauce Caress BarSoap Dove BodyWash Dove BarSoap 0 1000 2000 3000 4000 5000 6000 # Response Figure 15: Consumer Response Distribution for Some of the Brands Brands contributed disproportionately to the Unilever database. For instance there is an overwhelming presence of Dove body products, whereas response for some of the products is relatively infrequent. The most frequent reported usage is of bar soap and body wash products, followed by the usage of food items, detergents and other miscellaneous products in that order. It seems evident that the patterns is not representative of true brand usage frequencies. 19 Massachusetts Institute of Technology 3.3.3 Insufficient Information on Source of Information in Non–Survey Data The imbalance in reported usage across various brands raised doubts as regards the origins of the Non– Survey data. Details on the source of the data and the methods of data collection were not disclosed clearly. We were advised to assume that the Non–Survey data represents information on redemption of coupon promotions. However we did not have details about the coupons themselves or methods of circulation. We note that the applicability of our data analysis results largely depend on the data quality and the extent to which it is understood. Though we believe that our results accurately reflect the process that generated the data, our results are as valuable in reality as the extent to which a real brand usage situation has been captured in the data presented to us. 3.4 Predictive Modeling For reasons we have listed before, our prediction efforts have been focused on predicting reported usage of individual brands. We briefly looked at the modeling of MVC also. The following summarizes our sequence of predictive analysis: • Fitting various algorithms to data set to identifying a common predictive modeling methodology leading to best results over a wide range of brands. • Examining the contribution of GQM and DM scores in predictive models • Conducting “Most Valuable Consumer” (MVC) based analysis to capture information contained in GQM or DM scores for prediction of MVC. • Drawing inferences and conclusions from model results and suggesting means for obtaining improved results 3.4.1 Choice of Models We fitted a number of naïve and sophisticated models, to arrive at a common predictive model that proved both accurate and interpretable for a wide range of brands. Some of the modeling techniques tried were: k– Nearest Neighbors, Classification Trees, Artificial Neural Networks and Logistic Regression. Choice of appropriate model involves a tradeoff between the accuracy of results, interpretability, ease of application to alternate data set and computational complexity. The models rank as follows in decreasing order of interpretability: Regression Models, Classification Trees, k-Nearest Neighbors and Neural Networks. Following is a brief description of each of some of these algorithms, focusing on the advantages and disadvantages of each. Artificial Neural Network (ANN): An artificial Neural Network is a mathematical model capable of robust classification even when the underlying data structure is quite complex. ANNs derive their predictive 20 Massachusetts Institute of Technology power from architecture of interconnected computational units. They also allow predictive modeling of more than one variable in a single iteration. Although capable of high accuracy, ANNs suffer from the drawback that their models can be hard to interpret. Thus they are more appropriate when predictive accuracy is more important than interpretability. They also require a very large training data, are susceptible to over – training and are more computationally intensive than other algorithms. Classification Trees: A classification tree makes data classifications based on a set of simple rules that can be organized in the form of a decision tree. While these models are not as intricate as ANNs, they are considerably more interpretable. The decision trees can be easily translated to business strategies. So they have become more popular in business context recently. This technique is also applicable to data sets with missing values, thereby considerably reducing the data cleaning efforts. k–Neareset Neighbors: This methodology is in many ways similar to the collaborative filtering methodology discussed earlier. Usage predictions of a given variable are computed as weighted average of the usage of other users. Its main advantage lies in the fact that it is effective when there are a lot of response variables. However the simplicity may come at the expense of loss of predictive accuracy. Logistic Regression: There has been a prior discussion of logistic regression. It has been used to model preferences in the marketing context. Some of the benefits include a record of success over a wide range of prediction problems, interpretability of the models, and the fast speed of algorithms used for model fitting. The output of logistic regression model is a set of posterior probabilities for which we can vary the threshold according to the desired level of predictive aggressiveness. Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network On fitting all the aforementioned algorithms to numerous brands and comparing the results, logistic regression was chosen as the common model for all predictive analysis henceforth. We observed that 21 Massachusetts Institute of Technology predictive results from logistic regression were at least as good as or better than the results obtained using other models for a wide range of products. Figure 16 illustrates the superior predictive performance of logistic regression compared to classification trees and k-nearest neighbors (appears as “User” in the figure) in the case of Dove body wash. 3.4.2 Predictive Efforts using Logistic Regression An exhaustive logistic regression analysis was conducted for all the seventy brands in the data set and a number of interesting results were observed. Varying Success in Predictive Efforts Logistic regression models were built to make predictions of each brand usage based on the demographics and model scores only. We observed varying degrees of success across various brands ranging from highly successful such as results for Breyer’s Ice Cream to poor results as seen for Lipton Tea Bags. We present depicting typically good, moderate and poor results. Good: Breyers Ice Cream Medium: Caress Body Wash Poor: Lipton Tea Bags Figure 17: Lift Curves for Different Unilever Brands Some of the conclusions to be made are as follows: a) For a majority of brands noticeable lift was observed. Typical lift curves over a wide range of products are similar to the Caress body wash lift curve shown above. This indicates that the demographic and model score variables contain considerable predictive power in this data set and can be used for making brand usage predictions. b) The most important predictive variables emerged to be the Golden Question Model (GQM) and Demographic Model (DM) scores. There are two possible explanations for this. The first is that MVC may be an important summary statistic that captures brand usage. The second is that the Golden Question Model takes a number of brand usage variables as inputs. Thus using the GQM scores to further predict the same usage variables can lead to artificially inflated results that may be misleading. 22 Massachusetts Institute of Technology c) The following demographics are seen to have significant presence in models for a wide range of brands: Significant Demographics Age Length of residence Gender Marital status Household Members Region Code Figure 18: Significant Demographic Parameters d) Brands with high coefficients in the GQM computation had significantly better lift curves. As already noted, this may be deceptive. Hence the lift charts for the brands which have been used as inputs in the computation of GQM must be viewed with caution. The Breyers ice cream in an example of such a brand. e) Brands for which response rate was below 10% of the total consumer base led to poor predictive efforts, as in the case of Lipton Tea This can be attributed to insufficient number of data entries available for training and validating the predictive models, leading to poor results. Example of a Predictive Model The following chart is an example of logistic regression predictive model for one of the Unilever brands – namely Gorton Fillets. The diagram graphically indicates t – scores for the model. Figure 19: Graphical Representation of Logistic Regression for Gorton Fillets 23 Massachusetts Institute of Technology The most important model coefficients represented above are as follows: Target: Breyers Ice Cream Logistic Regression Model Model Score GQM 4.7005 Model Score DM -2.521 Intercept -3.0433 Absence of Female Adult 0.2977 Gender -0.358 Age 0.0183 Absence of Unknown Adult 0.400 Home Renter 0.215 Figure 20: Logistic Regression Model Coefficients for Gorton Fillets During the course of study, it was made evident that GQM and DM scores held significant predictive information. For a majority of products, model coefficients were the highest for these score variables. As explained earlier, since GQM and DM scores were fit using Panel data. Their importance in models for the same brands on a different data set verifies a degree of similarity between the panel data and Non–Survey data, as well as establishes the importance of GQM and DM scores in predictive efforts. Modeling With and Without GQM Scores Further analysis was carried out to judge the contribution of GQM and DM score variables in predictive models. We wished to explore the comparative performance of models without the GQM and DM scores, based on demographics only. The figure below indicates superior performance of the models based on both demographics and model scores compared to models based on demographics only. (In Figure 21, “Reg” represents the model excluding GQM scores and “Reg 2” represents model based on demographics only for the prediction of Breyer’s Ice Cream usage). Figure 21: Comparative Lift Curves for Models With and Without GQM Scores 24 Massachusetts Institute of Technology 3.5 GQM Score – Based Stratified Analysis As documented previously, Unilever spent considerable efforts in identifying the “Most Valuable Consumers” (MVC) on a brand and company-wide level. Given the Golden Question Model scores and Demographic Model scores for each consumer, we were keen to enhance our predictive efforts using this information. The exhaustive logistic regression analysis had already convinced us that GQM and DM scores could be highly instrumental in predicting brand usage. A two pronged approach was followed in the MVC based analysis: 3.5.1 Stratified Prediction Models Previous modeling attempts were focused on fitting a single model for each brand to the entire data set and generating posterior probabilities using these models. Since GQM scores seemed to contain information on MVC, we stratified the data set based on model scores into three categories. Firstly a separate training data set was created. The partition was based on consumer distribution such that each stratum had roughly one – third consumers. All consumers with a GQM score greater than 0.679 were categorized as high GQM consumers. All consumers with a GQM score between 0.679 and 0.342 were categorized as medium GQM consumers and the rest were categorized as low GQM consumers. Separate logistic regression models were built for each of these three training data sets. Based on the strata cutoffs for the training set, validation data set was also divided into three corresponding data sets. Models fit on the respective training data sets were applied to the validation sets to compute posterior probabilities for each consumer in all the categories. Thus we generated 2 sets of posterior probabilities for each consumer – from the overall model and from the stratified analysis. Lift charts were drawn for both predictive efforts to compare performance. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 Stratified Analysis 50 60 Baseline 70 80 90 100 Overall Analysis Lift Curve comparing GQM Stratified and overall analysis Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models Figure 22 shows lift charts for GQM score–based stratified and overall models for Dove bar soap, which are representative for other brands as well. It is clear that both the methodologies lead to similar results. We 25 Massachusetts Institute of Technology noted that there was insignificant difference in the three separate models created for the stratified data sets and that these models were each close to the overall model as well. This claim was further substantiated by details of the model coefficients in each case. It was observed that the important variables in overall model and stratified models were the same for a single brand. There was only a slight variation in the coefficients for each of these variables. Thus we concluded that the stratification of data set according to GQM does not lead to improved results. 3.5.2 Predicting GQM Strata for Each Consumer We have discussed the reason which prevented us from thoroughly investigating the concept of MVC given the nature and type of data available to us. We found no direct indications of MVC in the Unilever data set, rather GQM and DM based estimates of MVC. Nevertheless we spent efforts using GQM scores as a target for our predictive models. Instead of predicting the exact GQM scores, based on demographics we tried to predict whether the consumer belonged to high, medium or low GQM strata. Our intention was to generate a representative MVC model using Non–Survey data based on demographics only. Logistic regression was used to arrive at results. GQM strata definitions were maintained the same as mentioned afore. Each consumer was assigned a strata number one, two or three depending on whether it belonged to the high, medium or low GQM strata. Subsequently predictive models were computed to predict the strata class for each consumer based on demographic variables only. A very low degree of success was achieved in predicting the GQM score strata that each consumer belonged to. For this reason and unclear applicability of this model, we did not find it appropriate to pursue this line of analysis further. 4 CLUSTERING ANALYSIS 4.1 Clustering Background Clustering places objects into groups or clusters suggested by the data. The objects in each cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. The observations are divided into clusters so that every observation belongs to at most one cluster. Clustering not only reveals inherent data characteristics by identifying points of similarity or dissimilarity in the data set, it aids in understanding data structure issues. If dissimilar data sets are aggregated to produce a bigger data set, clustering of the aggregated set might reveal underlying data sets. One of the added advantages of clustering analysis is that it can be applied to a data set with missing values as well. Aside from data cleaning and data structure issues, clustering results can also be of interest by identifying groups of consumers with similar traits, who may be targeted in a similar fashion. Additionally, we explored the possibility of improved prediction by modeling the individual clusters separately and then aggregating results. 26 Massachusetts Institute of Technology Clustering can be performed according to various methods. For our analysis we chose Ward’s Method which is somewhat more sophisticated than the popular but simple k–means method. Ward’s method is an iterative method that seeks to minimize the statistical spread of observations within a cluster. In this method, the distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each iteration, the within–cluster sum of squares is minimized over all partitions obtainable by merging two clusters from previous iteration. Clustering can be performed according to various measures of spread or distance among the data points. During our study we used the Least Squares method because it works fastest while dealing with large data sets. During the clustering analysis SAS Enterprise Miner computes an Importance Value between 0 and 1 for each variable in the data set. This represents the measure of worth of the given variable in the formation of clusters. While the data is split into clusters, Importance Value of each variable indicates the extent to which each variable was influential in the splitting process. An importance of 0 indicates that this variable was not used as splitting criteria for clustering and an Importance value of 1 indicates that this variable had the highest worth in splitting criteria. One of the most important tools which help us interpret individual clusters is the Input Mean Chart for each cluster. This allows a comparison of the variable mean for selected clusters to the overall variable means. The input means are normalized using a scale transformation function: y= x − min( x) max( x) − min( x) For example assume 5 input variables yi = y1,…,.y5 and 3 clusters C1, C2, C3. Let the input mean for variable Yi in cluster Cj be represented by Mij. Then the normalized mean, or input mean, SMij becomes: SM ij = M ij − min( M i1 , M i 2 , M i 3 ) max(M i1 , M i 2 , M i 3 ) − min( M i1 , M i 2 , M i 3 ) The input means are normalized to fall in a range from 0 to 1. For each cluster input means are ranked based on magnitude of difference between the input means for the selected cluster(s) and the overall input means. The variables with the highest spreads typically best characterize the selected clusters(s). Input means that are very close to the overall means are not very helpful in describing the unique attributes of consumers within the selected cluster(s). 4.2 Cluster Analysis Details 27 Massachusetts Institute of Technology We performed a clustering analysis to identify distinctive groups of consumers based on their brand usage responses only. All the seventy brands were used as input variables for the clustering analysis. The clustering analysis led to a few noteworthy insights and enhanced our understanding of the data. Analysis was performed over the entire Non–Survey data set. We repeated the clustering several times for various randomly sampled subsets of the Non–Survey data to ensure generality and accuracy of the results. The repeated runs of the clustering algorithm over these various subsets gave similar results, and seemed to indicate an obvious clustering into 3 groups. Results shown below are for the clustering of the entire Non– Survey data set. Figure 23: Cluster Pie Chart Figure 23 depicts the three clusters obtained as the distinct pie sections which are indicative of the frequency of data points in each cluster. The color of each pie section is reflective of the root mean square standard deviation of points in each cluster and its value has been provided in Figure 24 Cluster % Data Cluster Std # Points Deviation 1 38 0.246 2 8.7 0.269 3 53.3 0.194 Cluster Description Frequent Buyers of food Infrequent buyers of item soaps/ washes Frequent Buyers of Buyers of items with detergents low response rate Frequent Buyers of Infrequent buyers of body soaps/ washes food items Figure 24: Cluster Statistics Figure 24 shows some statistical details of each cluster. These include: 28 Massachusetts Institute of Technology • Percentage Data Points – Percentage of data points in each cluster (Total number of data points – 8,608). • Cluster Standard Deviation – Average standard Deviation of each point in the cluster from cluster mean, which indicates within-cluster spread of the data. • Cluster Description – A qualitative description of distinguishing features of each cluster The cluster descriptions are based on the input means plots and importance values described previously. Another tool used in the process was a decision tree that is created in the clustering analysis. It is somewhat similar to the classification trees. A set of simple variables–based rules is generated which approximates the cluster boundaries. The following section sheds light on methods of deciphering the crucial features of distinction amongst all the clusters and discussion on similarity traits within each cluster. 4.3 Inference from Cluster Analysis 4.3.1 Effects of Unbalanced Brand Representation Figure 25 depicts importance values for the most critical variables based on which the clustering of data points was carried out. Brands Importance Value of Brands Suave fabric cond GHB ice cream Ragu pasta sauce Dove body wash Surf heavy duty lq All laundry powder Caress body wash Oofo bar soap Dial bar soap Pond’s face Oofo body wash Caress bar soap Dove body wash 0 0.2 0.4 0.6 0.8 1 Imporatance Value Figure 25: Importance Value of Significant Variables in Clustering We find that the personal wash products are the most important variables for the clustering analysis. We also noticed that the frequency of brand usage affects the importance value, with typically all the brands with low reported usage having a low importance value. Thus we note that the importance values are related to the unbalanced brand representation in the data set. 29 Massachusetts Institute of Technology 4.3.2 Details of Cluster 1 Cluster 1 consists of 38% of the consumers (3,271). The above chart presents means of particular brands within cluster 1 as compared to the overall mean of the same brands over the entire data set. It is to be noted that complimentary brand usage response has been modeled (Thus blocks to the left of the mean actually indicate that the cluster has a higher incidence of response for that product). For instance the chart shows 55% of the consumers indicate a positive response for Dove body wash in the overall data set but this is true for only 32% of the consumers in cluster 1. This means that the number of Dove body wash buyers is significantly more than the average consumer in the dataset. Figure 26: Input Means for Cluster 1 We conclude that consumers in cluster 1 display a significantly high propensity for purchase of body hygiene products including bar soaps and body washes. Some of the main products purchased by these individuals in the order of significance are: Dove body wash, Dial bar soap, Caress bar soap, Oil of Olay body wash etc. Another noticeable attribute is that they are also infrequent buyers of food items like Ragu pasta sauce, Wishbone salad dressing, Gorton fillets, etc. 4.3.3 Details of Cluster 2 Cluster 2 consists of only 8.7% of the consumers (748). Cluster 2 consists of individuals that are infrequent buyers of both the personal wash and food items. They seem to be purchasers of items with very low response rate like Ponds Face, VICL Body, VSLN Body and some of the detergents. It is possible that these responses were collected from a data set of beauty products and laundry products purchasers, or these are simply outliers in the data set. 30 Massachusetts Institute of Technology Figure 27: Input Means for Cluster 2 4.3.4 Details of Cluster 3 Cluster 3 consists of 53% consumers (4588). The most important attribute of individuals in this cluster is that they are highly frequent buyers of food items compared to average consumers. Important food items include Ragu Pasta Sauce, Wishbone salad dressing, Breyers Ice cream, Gorton Fillets. These are coincidentally also characterized as infrequent buyers of personal hygiene products. For example they are infrequent buyers of Caress Bar Soap, Dove Body Wash, Dial Bar Soap, Lever 2000 Bar Soap, etc. Fig 28: Input Means for Cluster 3 31 Massachusetts Institute of Technology Based on the consumer attributes revealed by the clustering analysis, we conclude that the clustering found may be approximating the underlying data sets that have been aggregated. We found it intriguing to observe that the cluster of individuals with higher propensity for purchasing body hygiene products should have a significantly lower tendency for purchasing food products and that the cluster of consumers with a high propensity for purchasing food items should have a lower propensity for purchasing body hygiene products. Also, the few outlying individuals who are neither frequent purchaser of personal wash or food items have been clustered separately. Possibly information from body hygiene brands like Dove and food items like Ragu Pasta Sauce were collected separately and aggregated together in one data set, which would explain the clusters that we observe. As mentioned several times before, body soap and wash products overwhelm the usage data. Therefore the clustering analysis also segregates frequent and infrequent buyers of these items into separate clusters. Based on the likelihood that the data available to us has been aggregated from various sources and the differences in the source may be guiding the clustering analysis, extrapolation of the results to true usage behavior may be inappropriate. This means that the clusters we have generated may accurately reflect groups in the data but not be indicative of the true consumer usage pattern. 4.5 Cluster–Wise Predictive Modeling Regardless of the reason for emergence of data clusters, they may potentially lead to improved prediction results. Here we investigate whether fitting a separate model for each cluster can lead to better results. If accuracy is improved, it suggests that separate clusters of consumers should be preferably modeled individually. Once it was identified that there exists a possibility of prior data aggregation leading to artificial structure, we explored a cluster–wise predictive methodology to obtain better results. The objective was to conclude whether agglomeration of data sets was desirable or data for separate brands and product categories should be treated separately. For this purpose, each data cluster was treated as separate data set and logistic regression models were built for each. Thus predictive probabilities were generated for each data point according to the cluster it belonged to. Probabilities for each of these data entries were also generated through an overall model fitted to the entire agglomerated data set. Lift curves for generated for both the analysis and compared. Figure 29 shows comparative lift charts for cluster–wise and overall analysis for Dove Bar Soap, which is typical of lift curves observed for other brands too. This analysis proves that a cluster-wise predictive model is capable of giving significantly than a predictive model based on the entire agglomerated data set. This observation is in contrast to the weak results we obtained by separately modeling consumers in different GQM score strata. 32 Massachusetts Institute of Technology 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 30 Cluster Wise Analysis 50 70 Baseline 90 Overall Analysis Lift Curvess comparing Cluster-wise and Overall Analysis Figure 29: Comparative Lift Curves for Cluster – Based and Overall Models The conclusion to be drawn from this analysis is that modeling small homogenous groups in the data independently is a more useful exercise than fitting an overall model to an agglomerated data set. If our clusters are reflecting underlying data sets from different product categories that have been agglomerated to form the Unilever Database, then it is more advantageous for Unilever to analyze its various data sets independently. 5 PROJECT SUMMARY AND CONCLUSIONS We conclude the report with a brief summary of what has been accomplished. Below we have enumerated some overall conclusions and recommendations based on all our analysis. A single extract of the Unilever database was made available to us for the purpose of developing modeling methodologies useful for Unilever’s business, to identify actionable insights from the data, and to evaluate the data as an asset for targeted marketing. In addition to sizeable efforts trying to understand, clean, and prepare the data, we focused on generating predictive models of individual brand usage because such models are arguably the most valuable tool for targeted marketing. Our efforts and interactions with Unilever representatives gave us a better understanding of the data which in turn led to more refined analyses on subsets of the data. Finally, we generated clustering models to identify groupings in the data, whether due to natural brand usage patterns or induced artificially through combination of various data sets. We list a number of overall trends and conclusions that arise out of our analyses: • There is a need to understand the content of the Unilever database more fully. This includes gaining greater understanding of the effect of missing data and assessment of the quality of some of the third party information. Also, the Unilever database includes some information on dates 33 Massachusetts Institute of Technology and usage quantities. The availability of such data could potentially lead to more interesting analyses. • Logistic regression appears to be a suitable method for most of the prediction tasks undertaken. It has the benefits of being well-studied for these types of data and of being interpretable. Its performance was comparable to more complicated methods for most of the tasks we tried. Overall, we observed that data quality issues seemed more important than the choice of modeling methodology. • The Unilever data seems to be aggregated from several different data sources, many of which seem to be incompletely understood and/or poorly documented. We note that the quality and applicability of any data analysis depends critically on the quality of the underlying data and the extent to which it is understood. • Our prediction efforts seem to indicate that in general, predictions of self-reported brand usage based on MVC estimations and on usage of other products were somewhat effective. We believe that in this data set there is a potential problem with using MVC estimates to predict usage of certain individual brands, as these MVC estimates are sometimes based on the same data we are trying to predict. Demographic variables seemed to have less predictive power. • Many of our prediction and clustering analyses seemed to be heavily influenced by data aggregation effects. In many cases, these effects may have dominated any underlying consumer usage effects. Due to this, our results may be representative of the data we are working with but extrapolate poorly to actual usage situations. • A strategy of clustering consumers first, then fitting prediction models gave improved prediction results. This suggests that data may best be analyzed in a cluster-wise manner. If clusters are representative of underlying data sources, then these data sources may be better analyzed individually rather in an aggregated fashion. • While the Unilever database provides information on a large number of consumers, it is our opinion that modeling and insight generation about consumer behavior is best performed on a cleaner data set that is better understood and more uniformly gathered. Examples include the panel data to which Unilever already has access and data gathered by retailers of Unilever brands. 34 Massachusetts Institute of Technology