Unilever Data Analysis Project June 2003 Paper 179

advertisement
A research and education initiative at the MIT
Sloan School of Management
Unilever Data Analysis Project
Paper 179
Dimitris Bertsimas
Adam Mersereau
Geetanjali Mittal
June 2003
For more information,
please visit our website at http://ebusiness.mit.edu
or contact the Center directly at ebusiness@mit.edu
or 617-253-7054
UNILEVER DATA ANALYSIS
PROJECT
BY
DIMITRIS BERTSIMAS
ADAM MERSEREAU
GEETANJALI MITTAL
Massachusetts Institute of Technology
TABLE OF CONTENTS
1 INTRODUCTION
1.1
1
Summary of Data and Unilever's Previous Data Mining Efforts
1
1.1.1
Panel Data
2
1.1.2
The Most Valuable Consumer and Existing Predictive Models
2
1.1.3
Unilever Database
3
1.2
Project Research Directions
3
1.3
Data Provided
3
2 PREDICTION EFFORTS ON ORIGINAL DATA SET
6
2.1 Data Cleaning
6
2.2 Predictive Methods Investigated
8
2.3 Results
9
2.4 Conclusions
11
2.5 Block Structure in Unilever Data Extract
12
2.6 Analysis of Fall 2000 Survey Data
13
3 NON-SURVEY DATA ANALYSIS
14
3.1 Data Details
15
3.2 Data Processing and Transformation
16
3.2.1
Data Cleaning
17
3.2.2
Data Aggregation
17
3.3 Data Issues
18
3.3.1
Conflicts in Computation of Household Layout Variables
18
3.3.2
Unbalanced Brand Representation
19
3.3.3
Insufficient Information on Source of Information in Non–Survey Data
20
3.4 Predictive Modeling
20
3.4.1
Choice of Models
20
3.4.2
Predictive Efforts using Logistic Regression
22
3.5 GQM Score–Based Stratified Analysis
25
3.5.1
Stratified Prediction Models
25
3.5.2
Predicting GQM Strata for Each Consumer
26
ii
Massachusetts Institute of Technology
4
CLUSTERING ANALYSIS
26
4.1 Clustering Background
28
4.2 Cluster Analysis Details
28
4.3 Inference from Cluster Analysis
29
4.3.1
Effects of Unbalanced Brand Representation
29
4.3.2
Details of Cluster 1
30
4.3.3
Details of Cluster 2
30
4.3.4
Details of Cluster 3
31
4.5 Cluster–Wise Predictive Modeling
5 PROJECT SUMMARY AND CONCLUSIONS
iii
Massachusetts Institute of Technology
32
33
TABLE OF FIGURES
Figure 1: Graphic Representation of MVC
2
Figure 2: Brands with Top Ten Consumer Responses
6
Figure 3: Consumer Response Distribution for Overall Data
7
Figure 4: Representative Diagram of Unilever Data Structure
7
Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
10
Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
10
Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash
11
Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce
11
Figure9: Graphical Representation of Unilever Data Structure
12
Figure 10: Lift Curves for Logistic Regression Models Based on Fall 2000 Survey Data
14
Figure 11: Graphical Representation of Non–Survey Data
15
Figure 12: Consumer Response Distribution for Non–Survey Data
16
Figure 13: Category – wise Brand Distribution
16
Figure 14: Household Member Age–Gender–Wise Aggregation
18
Figure 15: Consumer Response Distribution for Some of the Brands
19
Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network
21
Figure 17: Lift Curves for Different Unilever Brands
22
Figure 18: Significant Demographic Parameters
23
Figure 19: Graphical Representation of Logistic Regression Model for Gorton Fillets
23
Figure 20: Logistic Regression Model Coefficients for Gorton Fillets
24
Figure 21: Comparative Lift Curves for Models with and without GQM Scores
24
Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models
25
Figure 23: Cluster Pie Chart
28
Figure 24: Cluster Statistics
28
Figure 25: Importance Value of Significant Variables
29
Figure 26: Input Means for Cluster 1
30
Figure 27: Input Means for Cluster 2
31
Figure 28: Input Means for Cluster 3
31
Figure 29: Comparative Lift Curves for Cluster – based and Overall Models
33
iv
Massachusetts Institute of Technology
1
INTRODUCTION
This report describes and concludes the data analysis project undertaken in collaboration with Unilever
through the Sloan School of Management Center for eBusiness. In this document we trace our interactions
with Unilever, describe the data made available to us, describe various analyses and results, and present
overall conclusions learned in the course of the project.
Unilever has been a pioneer in mass-marketing, which focuses on widely broadcast advertising messages.
This marketing approach is at odds with new trends towards targeted marketing and CRM (Consumer
Relationship Management), and Unilever is interested in investigating how the new marketing philosophy
applies to the packaged consumer goods industry in general and to Unilever in particular. As Unilever sells
its products not directly to consumers but through a variety of retail channels, they have indirect contact
with the end consumers. Thus the application of CRM (defined by Unilever as “Consumer Relationship
Management” rather than the more standard “Customer Relationship Management” to make this
distinction) is less obvious in their business. In particular, Unilever recognizes three clear obstacles to the
application of CRM ideas in the packaged goods industry:
•
Consumer transaction data is difficult to procure in the packaged goods industry.
•
Data mining expertise and experience are new to packaged goods companies.
•
The packaged goods industry marketing efforts focus on brands rather than on consumers.
Unilever employs data mining in the area of CRM. This effort is partly undertaken in the Relationship
Marketing Innovation Center (RMIC), which is a group that transcends Unilever’s individual brands.
RMIC Unilever’s project with MIT has been part of these efforts.
Especially in light of the second bullet point above, the MIT team was asked to help evaluate the potential
of data mining technology for Consumer targeting at Unilever’s business. Unilever was to make available
to the MIT team representative samples of the data at Unilever’s disposal, and the MIT team was to analyze
this data and research new data mining methods for making use of this data in a targeted marketing
framework.
1.1
Summary of Data and Unilever’s Previous Data Mining Efforts
This section summarizes our understanding of the data Unilever has available for analysis, as well as
Unilever’s previous data mining efforts with this data. A primary data source is a Unilever database of
consumers. The database contains information on individuals and households who have interacted with
Unilever in some fashion in the past. The data includes demographic and geographic information as well
as self-reported usage and survey information.
1
Massachusetts Institute of Technology
1.1.1 Models
Based on information available on a subset of Unilever consumers two models were fit, the so-called
Demographic and Golden Question models, for predicting if a consumer is an MVC (“Most Valuable
Consumer”) as measured by their dollar spend on Unilever’s collection of brands. The concept of MVC
and
the
two
models
are
described
in
more
detail
in
the
next
paragraph.
1.1.2 The Most Valuable Consumer and Existing Predictive Models
Much of Unilever’s data mining efforts prior to August, 2001, were focused on identifying the “most
valuable consumers” (MVCs) on a brand- and company-wide level. The MVC for a specific brand is a
consumer determined to spend highly on the Unilever brand and on the industry category. Specifically,
rank consumers both by their dollar spend on a Unilever brand and by their dollar spend in the
corresponding industry category. Individuals are categorized as a “heavy”, “medium”, or “low” brand or
category consumer. The MVCs for a given brand are generally defined as those consumers found in the
shaded regions of the following table.
H
M
L
L
M
H
profitability to Unilever brand
Figure 1: Graphic Representation of MVC
The concept of MVC can also be extended to the level of the overall company. Unilever’s Demographic
and Golden Question models are logistic regression models that estimate the probability that an individual
consumer is an MVC in terms of their expenditure to Unilever as a whole. The demographic model uses
demographic variables exclusively, while the Golden Question model uses both demographic input as well
as a minimal set of survey responses about product usages. These models are used to score and rank
individuals in the Unilever database.
1.1.3 Unilever Database
The Unilever Database is a large data warehouse owned by Unilever but maintained by Axciom, a data
warehousing company. The database includes varying amounts of information on Unilever consumers,
compiled from a number of sources. For each consumer, the database potentially reports on:
•
Demographic data at the individual, household, and geographic block levels
•
Responses to promotional events
2
Massachusetts Institute of Technology
•
Survey responses
•
Predictions of Demographic and Golden Question models
The database contains no transactional purchase information, although it does provide self-reported brand
usage data, model predictions, and contact history information. The accuracy of the self-reported brand
usage data varies by brand.
1.2
Project Research Directions
At a project kickoff meeting in Greenwich, CT in August, 2001, we discussed several research directions of
interest to both MIT and Unilever:
•
Investigate methods for predicting, characterizing, and clustering individual consumer usage of
products.
•
Experiment with alternate definitions of MVC.
•
Develop alternative models to predict MVC.
•
Develop dynamic logistic regression models—that is, prediction methodologies based on logistic
regression that evolve in time.
•
Develop optimization-based logistic regression subset selection methodologies. Specifically, how can
we use such a methodology to design a questionnaire of maximum value and minimum length?
Although this list includes a number of items that may be of interest to Unilever in the future, the majority
of our efforts were focused on the first of these topics. This decision was largely guided by the data made
available to us, which includes no information on consumer profitability and contains limited time stamp
information. In subsequent sections of the report, we will revisit the data limitations and the role of the data
in guiding our analysis.
1.3
Data Provided
We received several data files from Unilever on a CD dated November 30, 2001. The files and our
understanding of them are as follows:
•
“UNIFORM.TXT”: A large data file sampled from the Unilever Database. The extraction was
performed via uniform random sampling from the most complete and reliable records from the
database.
•
“STRATIFIED.TXT”: A large data file sampled from the Unilever Database. This file is similar to
UNIFORM.TXT except in the method of sampling. Stratified sampling was used and stratification
was done with respect to the Golden Question model scores.
•
“MIT LAYOUT #1 AXCIOM.XLS”: A list of variable names and layout for the data in
UNIFORM.TXT and STRATIFIED.TXT.
3
Massachusetts Institute of Technology
•
“LAYOUT DESCRIPTION #1.XLS”: A list of variable names along with a brief description of
many of the variables.
•
“MIT LAYOUT #2 INFOBASE.XLS”: An enumeration of InfoBase data entries, which form some
of the variables in UNIFORM.TXT and STRATIFIED.TXT. Many of the variables described in MIT
LAYOUT #2 INFOBASE.XLS do not appear in the data files.
•
“DATA DICTIONARY JUNE 2001.DOC”: An enumeration of InfoBase variables, only some of
which are included in the data files.
•
“BRAND ID&NAME.XLS”: A table matching brand ids to brand names.
•
“MARKET.TXT”: A table matching Market Codes to county names.
•
“VIC’S REPLY.TXT”: Some detail clarifications on LAYOUT DESCRIPTION #1.XLS.
We were subsequently given the following file:
•
“MASTER.XLS”: A table matching brand ids to brand names, franchises, and product categories.
We were provided no other description or information regarding the data.
The UNIFORM.TXT data file has a raw size of 114 Mb, and includes 367 variables for 46,307
observations. The STRATIFIED.TXT data file has a raw size of 133 Mb, and includes the same 367
variables for 53,693 observations. Many of the observations in both data files include significant missing
data. To summarize these data sets, we provide a brief discussion of the sets of variables included in
UNIFORM.TXT and STRATIFIED.TXT:
•
Individual Layout: This set of variables includes individual and household ID codes, individuals'
names, and basic demographic variables for age and gender. Other demographic variables like
”Marital Status,” ”Employment Status,” “Occupation Type,” and “Ethnic Code” have significantly
large number of missing values.
•
Household Layout: This set of variables includes data specific to a household (note that a
household may include multiple individuals), and is a result of the merging of the Unilever
Database with data from third-party sources.
Third-party data offers information on household vehicle ownership, the distribution of genders and ages in
the household, occupations, home ownership, credit card ownership, and membership in a number of
lifestyle clusters (e.g. “traditionalist,” “home and garden”). Most of these fields have at most 20% missing
data.
•
Demographic Model: This set of variables gives results of the Demographic logistic regression
model for prediction of MVC. The most important field here is the “Model Score” which gives
values between 0 and 1, the model’s prediction. The field “Model Score Group” denotes the
decile in which the model scores falls. Deciles are defined according to the Demographic model
results.
4
Massachusetts Institute of Technology
•
Golden Question Model: This set of variables gives results of the golden question logistic
regression model for prediction of MVC. The most important field here is the “Model Score”
which gives values between 0 and 1, the model’s predictions. The field ``Model Score Group''
groups the observations into deciles according to the Golden Question model results.
The
Demographic and Golden Questions models are positively correlated. We have measured a
correlation of 0.6 on the UNIFORM.TXT data.
•
Block Group: This set of variables describes the block the household belongs to. A block is an
address-based segment of the population. Thus, there are likely multiple households per block.
These variables essentially provide demographic information about the geographical neighborhood of the
household. Variables describe the urban/rural breakdown, the ethnic breakdown, the distribution of home
valuation, employment breakdown, education level, etc.
•
Brand Usage Layout #01-#20: For each individual, there are 20 sets of brand usage variables.
Thus each individual is associated with at most 20 brands. The brands are a mixture of Unilever
and non-Unilever brands.
A total of 259 brands appear in the UNIFORM.TXT dataset, with 96 chosen by at least 100 individuals, 55
chosen by at least 1000 individuals, and 17 chosen by at least 10000 individuals. The average individual
reports interaction with 9.5 distinct brands. In our analysis we concentrated on the file UNIFORM.TXT
instead of the file STRATIFIED.TXT, because it seemed appropriate to use a representative sample of the
underlying data set.
2
PREDICTION EFFORTS ON ORIGINAL DATA SET
Our initial efforts were towards developing methods for predicting usage of individual brands using
demographic variables and cross-purchase information as inputs. We were particularly interested in
methodological innovations for using these different sets of variables to make useful predictions.
At this stage of the project we chose to concentrate on the prediction of usage for individual brands due to
the following reasons:
• Prediction of brand usage is of obvious use in targeted marketing
• A lack of information on consumer profitability and limited time stamp information eliminated
several of the topics mentioned in section 1.2
5
Massachusetts Institute of Technology
• The problem is of general interest to Unilever and other packaged goods companies that have
large amounts of data over a wide range of products.
2.1
Data Cleaning
In efforts to clean the data set for analysis, we first performed a brand aggregation using the Franchise and
Category information provided in the MASTER.XLS file. The reason for this was to eliminate the
distinction between very similar products.
For example, we judged that different flavors and versions of the same product should appear the same
from the perspective of a company-level analysis. The new brand labels in most cases can be mapped oneto-one to the original notion of brands. After eliminating those brands with reported usage by fewer than
100 individuals, we were left with 76 unique brands for analysis in the UNIFORM.TXT data set. Figure 2
shows the brands with top ten consumer response and Figure 3 shows the distribution of reported usage
among the 76 brands. Note that the patterns of reported usage in this data is not representative of the actual
sales of these products.
# Consumer
Response
Product
9367
6643
6146
4989
4938
4923
4640
4554
4399
4302
Suave Shampoo / Conditioner
Dove Bar Soap
Ragu Pasta Sauce
Lipton Tea Bags
Dial Bar Soap
Bath & Body Works Bar Soap
Lever 2000 Bar Soap
Good Humor / Breyers Ice Cream
Other Body Wash
Dove Body Wash
Figure 2: Brands with Top Ten Consumer Responses
10000
# Customers Using Product (13870 tot
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1
5
9
13 1 7 21 25 29 33 37 41 4 5 49 53 57 6 1 65 69 7 3
Figure 3: Consumer Response Distribution for Overall Data
6
Massachusetts Institute of Technology
For each of these 76 products, we developed a binary indicator indicating the reported usage for all the
consumers, with 1 representing a positive usage response and 0 representing no indication of usage.
In addition to the 76 brands, we focused on the following four demographic variables in our prediction
efforts:
•
Presence of Child (Yes/No)
•
Household Size
•
Income Category
•
Geographic Region (North, South, Midwest, West)
These variables were chosen because they included relatively few missing values and because they
generally represent important individual and household indicators which act as proxies for underlying
factors that influence purchase behavior.
The resulting cleaned data thus contained four demographic variables and 76 binary reported usage
BRAND
…
BRAND
BRAND
Region
Income
Category
Househol
d Size
Child
Individua
l
variables for each of the 46307 consumers. A representation of this data is as follows:
1
2
…
46307
Demographic variables
Usage variables
Figure 4: Representative Diagram of Unilever Data Structure
As is common practice in data mining studies, to deal with the so–called “over–fitting” problem the
UNIFORM.TXT data was randomly partitioned into a training set of 27,731 consumers for fitting of model
parameters, a validation set of 9,291 consumers for choosing among models, and a test set of 9,285
consumers for measuring final results.
2.2
Predictive Methods Investigated
Logistic regression has a long history of use in targeted marketing for linking demographic variables and
purchase behavior, while collaborative filtering is an approach that has developed with the rise of ecommerce for predicting a consumer’s preferences based on the preferences of similar consumers. Our
7
Massachusetts Institute of Technology
interests in this study were in investigating the relative merits of these two methodologies and of combining
their results in various ways.
Logistic regression establishes a relationship between predictor variables and a response variable via the
logistic function. In particular, we model a consumer’s probability of response p in terms of a set of
predictor variables x1, …, xn as follows:


exp ∑ β i xi 
 i

p=


1 + exp ∑ β i xi 
 i

Logistic regression has often been used in marketing contexts, and has the advantages that the model is
interpretable and can be fit using efficient methods. It can be susceptible to overfitting, however, when
there are many predictor variables.
Collaborative filtering models an individual consumer’s response as a weighted average of the responses of
other consumers in the database, where the weighting is typically according to a similarity measure among
consumers. The approach is appropriate in applications where there are sufficiently large numbers of
products to allow computation of a useful similarity measure among consumers. The collaborative filtering
approach has proved useful particularly in internet recommender systems.
Examples include the
recommendation engines employed by Amazon and Netflix. Collaborative filtering is not as easily
interpretable as logistic regression, but has the advantages that it is conceptually simple and is adept at
handling many variables representing choices among a large number of possibilities.
In our implementation, we compute similarities among consumers based on reported usage information
only. Since this is binary data, we require a suitable similarity measure. We have made use of the socalled Jaccard similarity. Given usage vectors of two individual consumers, define a to be the number of
products the two consumers have in common, b to be the number of products unique to consumer 1, and c
to be the number of products unique to consumer 2. Then the Jaccard similarity is given by the ratio
a/(a+b+c). The Jaccard measure thus takes into account products the two consumers have in common, but
ignores items that neither have chosen. As our reported usage data is relatively sparse, there are likely to be
many products chosen by neither. With the Jaccard similarity measure, these products will not inflate the
similarity measure as they would with, say, a correlation measure. We also made use of a weighting
scheme that weighs rare products more heavily than common ones. Such a modification is based on the
observation that selection of a rare product is more informative than selection of a common product. Such
an “inverse frequency” weighing is common in collaborative filtering systems.
8
Massachusetts Institute of Technology
Initial tests of these methods indicated that we might be able to produce more accurate predictions by using
the predictions of a logistic regression model and a collaborative filtering model as inputs to a third model.
We considered three methods for combining these models: a weighted average of the two results, another
logistic regression taking the two results as inputs, and an optimization model that computes a linear
discriminant in the space of the logistic and collaborative model outputs. Upon further testing, we decided
that the logistic regression method for combining models exhibits the best performance.
In the end, we examined several models for predicting individual product usage from the demographic and
other product usage data:
•
RAND : Predictions are random numbers between 0 and 1. This is less of a prediction method
than a baseline for comparison.
•
LOGIT : A logistic regression model using demographic variables as predictors.
•
COLLAB : The collaborative filtering model using reported usage data from other consumers.
•
COMB_LOGIT : A logistic regression model using the LOGIT and COLLAB model predictions
as predictor variables.
•
FULL_LOGIT : A logistic regression model using the demographic variables as well as the usage
variables for products other than the one being modeled. We implemented two versions of the
FULL_LOGIT model: one using all variables available, and one which uses subset selection
methods to choose an accurate model with only 5 variables.
2.3
Results
We present our results for the various methods in the form of lift curves. To compute a lift curve for a
given product, we use a method described above to assign each consumer in the validation set an estimated
probability of usage. We then rank the consumers in order of their predicted usage, and ask the question “If
we contacted the top M consumers in this list, how many actual users would we find?” If we repeat this
question for every choice of M, we obtain the lift curve. Random predictions should roughly give a straight
diagonal line, while effective prediction algorithms will result in lift curves as high as possible above this
diagonal line. Higher lifts indicate better performance of a predictive model.
9
Massachusetts Institute of Technology
Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
Figure 5 includes lift curves for the models used to predict usage of the products Caress Body Wash and
Classico Pasta Sauce, noting that these are indicative examples of results obtained for several products.
The first set of lift curves illustrates the results for the COLLAB, LOGIT, and COMB_LOGIT methods.
We observe that the while the LOGIT model using the four demographic variables does better than random
prediction, the COLLAB model using usage information of other products does significantly better.
Combining the two models does only slightly better than the collaborative filtering approach.
The following set of lift curves adds the FULL_LOGIT method:
Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
The FULL_LOGIT model, based on all the variables, performs even better than the COMB_LOGIT model.
This logistic regression model using only a few selected variables exhibits surprisingly good performance.
This observation motivated a closer look at the specific coefficients of the variables in these parsimonious
10
Massachusetts Institute of Technology
models. The tables below give the coefficients for the two 5-variable models responsible for the lift curves
above:
Target: Caresss Body Wash
FULL_LOGITS Coefficients
Intercept
-2.47
Caress Bar Soap
1.61
Lever Body Wash
0.59
Oil Olay Body Wash
0.51
Herbal Essence Body Wash
0.46
Dove Body Wash
0.45
Figure 7: Coefficients of a Logistic Regression Model for Caress Body Wash
Target: Classico Pasta Sauce
FULL_LOGITS Coefficients
Intercept
-2.89
Five Bros Pasta Sauce
1.92
Francisco Rinaldi Pasta Sauce
1.31
Prego Pasta Sauce
0.52
Breyer’s Ice Cream
-2.11
Lipton Tea Bags
-2.53
Figure 8: Coefficients of a Logistic Regression Model for Classico Pasta Sauce
Thus, among the most useful variables for predicting Classico Pasta Sauce usage are other pasta sauce
brands, while other brands of body wash are among the most useful information for predicting usage of
Caress Body Wash in the overall data set.
2.4
Conclusions
Our analysis of the overall UNIFORM.TXT data set led us to some intriguing conclusions, and motivated a
closer look at the data set and its sources.
Methodologically, while combination models are interesting, logistic regression, perhaps with subset
selection, is a sufficiently powerful method for analyzing this data.
Furthermore, it has the added
advantage of interpretability and is well understood as a tool in marketing.
11
Massachusetts Institute of Technology
Our primary finding was that in this data set, reported brand usage variables are considerably more
powerful than the limited set of demographics we looked at. In particular, usage of a given product can be
predicted surprisingly accurately using usage data from a small number of closely related products.
These pronounced and powerful trends encouraged us to take a closer look at the underlying data. After
discussions with our counterparts at Unilever, it was revealed that much of the data we were working with
was aggregated from two consumer surveys. Indeed, one of the surveys asked questions regarding personal
washes, while the other did not. Also, one of the surveys included questions regarding pasta sauces, while
the other did not. Clearly, the differences between the two surveys largely explained the high correlation
we were observing among usages of similar products.
Thus, our most significant conclusion from this analysis was that our models were achieving impressive
results, but were likely modeling the data collection technique rather than the underlying phenomenon.
Unfortunately, such a model may generalize poorly to panel data or to real world situations. Such survey
data, in an aggregated format, may serve as a poor proxy for purchase behavior.
The results of this analysis motivated more work to understand the source of the data. Future efforts were
focused on identifying portions of the data that were as uniform as possible, and forming prediction and
clustering using relatively simple modeling techniques such as logistic regression.
2.5
Block Structure in the Unilever Data Extract
The previous analysis motivated a closer look at the data. At this point Unilever provided us copies of two
questionnaires, the responses to which comprised a large portion of the data in the Unilever database
extract. Using timestamps that indicated the dates of collection for the various responses, we were able to
obtain a more detailed understanding of the data. This structure is indicated in the following figure:
Demographic
variables
Consumer 1
..
Consumer
46307
D
M
G
Q
M
Usage
variables
S2000 survey responses
Non-Survey / “Coupon”
responses
F2000 survey responses
39 brands
11 Unilever brands
70
Figure9: Graphical Representation of Unilever Data Structure
12
Massachusetts Institute of Technology
In figure 9, rows indicate data available for a single consumer, while columns indicate different variables in
the data extract. Some demographics and model scores are reported for each consumer. In the section
marked as “Usage Variables,” the shaded blocks indicate the presence of self-reported usage data. After
some investigation, we believe that roughly half the consumers had Spring 2000 survey responses and no
Fall 2000 survey responses, while the other half had Fall 2000 responses and no Spring 2000 responses. A
subset of consumers from both groups also had some additional brand usage responses, which we hereafter
refer to as “Non-Survey” data. We were advised that this “Non-Survey” data largely represented responses
to coupon redemption.
In subsequent analyses, we concentrated on sets of brands and consumers whose usage data fell uniformly
in one of these blocks. The individual surveys reported on a relatively small set of brands, while the “Noncoupon” data included a much larger set of brands. For this reason, we focused our efforts on the NonSurvey data. In what follows, we will briefly discuss our limited modeling efforts on the Fall 2000 survey
data, and we will provide a lengthy discussion of an extensive analysis of the Non-Survey data.
2.6
Analysis of Fall 2000 Survey Data
While our efforts were focused on the Non-Survey data, we also performed an exploratory analysis of the
Fall 2000 survey data and associated demographic variables. We extracted a small sample of Fall 2000
data that included 2500 consumers.
The Fall 2000 survey data includes response data on a limited number of brands. The Unilever brands
represented are Suave, Lipton, Breyer’s, Wishbone, Dove, Lever2000, Caress, and Snuggle. This limited
amount of response information restrained the scope of analysis we could perform, and hence we tried the
following three indicative tasks:
•
Predicting reported Caress usage given all other available variables.
•
Predicting reported Caress usage given demographic variables only.
•
Predicting simultaneous Caress and Snuggle usage given all other available variables.
The third task was an attempt to use predictive modeling to identify cross-selling opportunities.
We used only the demographic variables described above: described region of residence, income levels,
household sizes, and presence of children. We tried several modeling methodologies including nearest
neighbor methods, logistic regression, discriminant analysis, classification trees, and neural networks.
These methodologies will be described in more detail in a subsequent section. We concluded that the
choice of modeling algorithm seemed to make insignificant difference in the quality of the results obtained.
The best-performing models predicted 100% non-usage, which gave a 32% misclassification rate for
Caress, and a 19% misclassification rate for Caress / Snuggle cross-sells. These results indicated that it is
difficult to make predictions given the limited number of variables.
13
Massachusetts Institute of Technology
# responses
350
300
Caress lift curve
(demographic and
response predictors)
250
Caress Reference
200
Caress&Snuggle lift
curve
150
100
Caress&Snuggle
Reference
50
0
0
500
1000
# targeted
1500
Caress lift curve
(demographic
predictors)
Figure 10: Lift Curves for Logistic Regression Models based on Fall 2000 Survey Data
On constraining the models to generate a reasonable number of usage predictions, the best models achieved
35% misclassification rate for Caress and 21% misclassification rate for Caress / Snuggle cross-sells.
Figure 10 shows lift curves from the logistic regression models. We observe that the model making use of
demographic variables only gave insignificant lift.
The models based on both demographic and brand usage information seemed to achieve more lift. The
most useful predictor variables seemed to be other brands of soap – namely Dove and Lever2000. While
this may reflect real consumer usage pattern, it may also be due to the design of the survey, which included
separate sections for soap and for other products. As with the other analyses in this report, the question
remains as to whether these results are transferable to real-world usage patterns.
3
NON–SURVEY DATA ANALYSIS
Here we report on a subset of the original UNIFORM.TXT which we refer to as the “Non–Survey” data.
Our decision to analyze this section of the data was guided by the internally homogenous composition of
this data set and the fact that it included information on a wide range of products.
•
Data Details: A detailed description of the Non–Survey data extraction and composition
highlighting some of the inherent features.
•
Data cleaning and aggregation: A detailed description of treatment of missing values or outliers
and transformation of some of the variables that we conducted.
•
Data credibility issues: A few of our observations suggesting the possibility of artificial data
structure and data bias issues.
14
Massachusetts Institute of Technology
•
Predictive modeling: An in - depth analysis to predict brand usage and explore the Most Valuable
Consumer concept using various modeling techniques based on both demographics and Golden
Question Model and Demographic Model scores.
•
Cluster analysis: A description of efforts to segregate consumers into distinctive clusters and to
use this potential information to enhance predictive efforts.
3.1
Data Details
The layout of data provided by Unilever was explained above. The Non–Survey data has been extracted
from the file UNIFORM.TXT. This file contains information on 46,307 consumers. Each usage entry in the
file UNIFORM.TXT bears a date stamp indicative of time of data collection. A majority of data entries in
this file bear one of the two time stamps – namely 15th May, 2000 and 15th November, 2000. Based on the
quantity of usage data associated with these two dates and the fact that the data with these time stamps
seems to correspond to the surveys provided to us, we assumed that usage entries with these time stamps
correspond to the Spring and Fall survey data. The remaining data is what we analyzed in this section and
has we refer to it as the Non–Survey data. As per information provided by Unilever, we believe the NonSurvey data represents product promotion coupon responses.
The Non–Survey data consists of 14,492 consumers. For each consumer we have demographic information
and reported usage for seventy brands, some of which are Non–Unilever brands. In addition, Golden
Question Model and Demographic Model scores have been provided for each consumer. The diagram
below is a graphical representation of the data layout and structure.
Usage
Demographic
variables
variables
G
D
Q
M
User1 1
Consumer
S2000 survey responses
Non-Survey /
“Coupon”
..
Consumer
46,
Use307
r 46307
F2000 survey responses
39 brands
11 Unilever brands
70 brands
Figure 11: Graphical Representation of Non–Survey Data
Each consumer reports usage of at most twenty brands. Following the data cleaning efforts, the maximum
number of brands reported by a consumer was reduced to seventeen. The following diagram shows a
distribution of consumers according to the number of responses reported by each consumer. It is observed
15
Massachusetts Institute of Technology
that a large majority of the consumers report usage of very few brands, which leads to the sparse nature of
the Non–Survey dataset.
Consumer Response Distribution
# Consumers
3000
2500
2000
1500
1000
500
0
1
3
5
7
9
11
13
15
17
# Responses
Figure 12: Consumer Response Distribution for Non–Survey Data
Figure 13 depicts category–wise brand distribution of Unilever and non–Unilever brands in the Non–
Survey data set. There is a dominating presence of body wash and bar soap brands in the data set, followed
by the presence of food items. This is because of the survey design and, therefore, the distribution is not
representative of true usage patterns, a potential bias that we will investigate later in the discussion.
Distribution of Brands
Misc
# Brands
Body Items
Food Items
Detergents
Shampoo
Body Wash
Bar Soap
0
8
16
24
32
40
Brand Categories
Figure 13: Category – wise Brand Distribution
3.2
Data Processing and Transformation
In contrast to our initial efforts, during the analysis on Non–Survey data we sought to incorporate a wide
range of demographic and model score variables for a more comprehensive study. This necessitated
numerous decisions regarding data set preparation like choice of demographics and treatment of missing
values and outliers.
16
Massachusetts Institute of Technology
3.2.1 Data Cleaning
Following the data cleaning efforts as mentioned in section 2.1, Non-Survey data set was further cleaned
and filtered. The total numbers of consumers to begin with were 14,492, which were reduced to 8,608
consumers after data cleaning and filtering. As stated earlier, the Non-Survey data had over 100
demographic variables and 70 brand variables. All the brand variables have been considered in the analysis
regardless of affiliation with Unilever. Block Layout variables were not considered in this analysis due to
their complex nature and to expedite the process. We believed the information contained in Household and
Individual variables was significant enough for revealing relevant information. Among the Individual
Layout and Household Layout demographics, variables with an excess of 20% missing values were
eliminated. Imputing missing values for such a large number of data entries would have led to misleading
results. Some of the demographic variables were rejected due to the ambiguous nature of their source and
method of computation. Certain variables that seemed to be derived from other variables in unclear ways
were also rejected. Examples of such variables include lifestyle clusters like traditionalist, home garden,
etc. For important demographic variables, all consumers with missing values were eliminated. For the
remaining demographics that were measured on a continuous scale, missing values were imputed with the
mean values. Outlying consumers whose corresponding demographic values were greater than or less than
five standard deviations from the mean were also eliminated. This led to negligible number of consumers
with missing values among demographics measured on a binary scale. These were also eliminated.
3.2.2 Data Aggregation
Certain demographic variables were aggregated for the purpose of obtaining simpler models which are
more interpretable, to decrease the computational effort and to deal with potential model over – fitting.
State – Wise Regional Aggregation
Variable FIPS_census_state contains information on the state of residence of the consumer. These were
aggregated into the following nine regions.
•
New England
•
Middle Atlantic
•
East North Central
•
West North Central
•
South Atlantic
•
East South Central
•
West South Central
•
Mountain
•
Pacific
This aggregation was necessitated for decreasing computational and time complexity. In addition, we
believed that a region – wise approach will be more insightful.
17
Massachusetts Institute of Technology
Household Member Age – Gender – Wise Aggregation
Variables containing information on age and gender–wise presence of household members were
aggregated. These are variables of type IB_males_0_2, IB_females_3_5, etc.
Presence of Male 0-2, 3-5,
6-10, 11-15, 16-17 years
Presence of Male 18-24,
25-34, 35-44, 45-54 years
Presence of Male Child (Non-Earning)
Presence of Male Adult (Earning)
Presence of Male 55-64,
65-74, 75 plus years
Presence of Male Senior
Figure 14: Household member age – gender – wise aggregation
The variables were grouped into presence of child/ adult/ senior member of male/ female/ unknown gender.
The age threshold for segregation was chosen based on certain assumptions about occupational status of
each household member according to the individual’s age. This is made clearer in Figure 14. The variables
were combined into children, adult and senior categories as indicated above. The same treatment was
extended to variables for female and unknown gender also. This led to a compression of 36 variables into 9
variables, with the preservation of age and gender–wise composition of the household and their influence
on consumer response. Prior to our meeting with Unilever representatives in July 2002, we assumed that
household members in the age group of 16 to 17 could be included in the adults category. However, we
were informed that Unilever considers consumers up to the age of 17 as children. We made necessary
modifications to the data set thereafter but no significant changes were recorded due to this minor
modification.
3.3
Data Issues
Prior to extensive predictive modeling efforts, the data was observed and examined to extract information
that might be useful in subsequent study. We observed several instances that suggested data inconsistencies
and possible biases in the data. Some of these issues relate to the means and methods of data collection and
interpretation, while others relate to possible artificial data structuring induced by aggregation of dissimilar
data from multiple sources. Presented below are a few cases in point.
3.3.1 Conflicts in Computation of Household_Layout variables
Household_Layout variables consisted of certain variables that supply gender-wise and age-wise
information on presence of household members. Examples include IB_males_0_2, IB_females_3_5, etc
(henceforth referred as household-member variables). A positive response indicates presence of a
18
Massachusetts Institute of Technology
household member in that age and sex group. This information was not extracted from the consumer
directly, rather derived from third-party data source. The accuracy if the data varied depending on source
of data. The following situations led us to doubt the accuracy of some of the information contained in these
variables
•
The data set also contains a variable called IB_house_size. Addition of unit responses in all the
aforementioned variables describing presence of household members should not exceed the value
indicated by variable IB_house_size. Yet it was observed that there was no correlation between the
aggregated value as obtained from the household-members variables and the house size variable. We
tried various combinations, with the inclusion or exclusion of unknown gender type members, yet
we failed to achieve a match in between the two sets of variables.
•
Similarly
there
was
no
reconciliation
between
the
values
represented
by
variables
IB_presence_of_child or IB_number_of_adults and the information aggregated using various
combinations of household-member variables.
The above data inconsistencies suggests that caution must be exercised in choosing variables to be
considered in the modeling analysis. Careful analysis of variable definitions and calculations must also be
done. To capture information on age and gender of household members, we chose to use the aggregated
household – member variables.
3.3.2 Unbalanced Brand Representation
Consider the following distribution of consumer response for some of the brands in the Non–Survey data.
# Response per Brand
Suave BarSoap
Brand Name
Mealmk StirFry
lvr2k Bodywash
Ragu PastaSauce
Caress BarSoap
Dove BodyWash
Dove BarSoap
0
1000
2000
3000
4000
5000
6000
# Response
Figure 15: Consumer Response Distribution for Some of the Brands
Brands contributed disproportionately to the Unilever database. For instance there is an overwhelming
presence of Dove body products, whereas response for some of the products is relatively infrequent. The
most frequent reported usage is of bar soap and body wash products, followed by the usage of food items,
detergents and other miscellaneous products in that order. It seems evident that the patterns is not
representative of true brand usage frequencies.
19
Massachusetts Institute of Technology
3.3.3 Insufficient Information on Source of Information in Non–Survey Data
The imbalance in reported usage across various brands raised doubts as regards the origins of the Non–
Survey data. Details on the source of the data and the methods of data collection were not disclosed
clearly. We were advised to assume that the Non–Survey data represents information on redemption of
coupon promotions. However we did not have details about the coupons themselves or methods of
circulation.
We note that the applicability of our data analysis results largely depend on the data quality and the extent
to which it is understood. Though we believe that our results accurately reflect the process that generated
the data, our results are as valuable in reality as the extent to which a real brand usage situation has been
captured in the data presented to us.
3.4
Predictive Modeling
For reasons we have listed before, our prediction efforts have been focused on predicting reported usage of
individual brands. We briefly looked at the modeling of MVC also. The following summarizes our
sequence of predictive analysis:
•
Fitting various algorithms to data set to identifying a common predictive modeling methodology
leading to best results over a wide range of brands.
•
Examining the contribution of GQM and DM scores in predictive models
•
Conducting “Most Valuable Consumer” (MVC) based analysis to capture information contained
in GQM or DM scores for prediction of MVC.
•
Drawing inferences and conclusions from model results and suggesting means for obtaining
improved results
3.4.1 Choice of Models
We fitted a number of naïve and sophisticated models, to arrive at a common predictive model that proved
both accurate and interpretable for a wide range of brands. Some of the modeling techniques tried were: k–
Nearest Neighbors, Classification Trees, Artificial Neural Networks and Logistic Regression. Choice of
appropriate model involves a tradeoff between the accuracy of results, interpretability, ease of application
to alternate data set and computational complexity. The models rank as follows in decreasing order of
interpretability: Regression Models, Classification Trees, k-Nearest Neighbors and Neural Networks.
Following is a brief description of each of some of these algorithms, focusing on the advantages and
disadvantages of each.
Artificial Neural Network (ANN): An artificial Neural Network is a mathematical model capable of robust
classification even when the underlying data structure is quite complex. ANNs derive their predictive
20
Massachusetts Institute of Technology
power from architecture of interconnected computational units. They also allow predictive modeling of
more than one variable in a single iteration. Although capable of high accuracy, ANNs suffer from the
drawback that their models can be hard to interpret. Thus they are more appropriate when predictive
accuracy is more important than interpretability. They also require a very large training data, are
susceptible to over – training and are more computationally intensive than other algorithms.
Classification Trees: A classification tree makes data classifications based on a set of simple rules that can
be organized in the form of a decision tree. While these models are not as intricate as ANNs, they are
considerably more interpretable. The decision trees can be easily translated to business strategies. So they
have become more popular in business context recently. This technique is also applicable to data sets with
missing values, thereby considerably reducing the data cleaning efforts.
k–Neareset Neighbors: This methodology is in many ways similar to the collaborative filtering
methodology discussed earlier. Usage predictions of a given variable are computed as weighted average of
the usage of other users. Its main advantage lies in the fact that it is effective when there are a lot of
response variables. However the simplicity may come at the expense of loss of predictive accuracy.
Logistic Regression: There has been a prior discussion of logistic regression. It has been used to model
preferences in the marketing context. Some of the benefits include a record of success over a wide range of
prediction problems, interpretability of the models, and the fast speed of algorithms used for model fitting.
The output of logistic regression model is a set of posterior probabilities for which we can vary the
threshold according to the desired level of predictive aggressiveness.
Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network
On fitting all the aforementioned algorithms to numerous brands and comparing the results, logistic
regression was chosen as the common model for all predictive analysis henceforth. We observed that
21
Massachusetts Institute of Technology
predictive results from logistic regression were at least as good as or better than the results obtained using
other models for a wide range of products. Figure 16 illustrates the superior predictive performance of
logistic regression compared to classification trees and k-nearest neighbors (appears as “User” in the
figure) in the case of Dove body wash.
3.4.2 Predictive Efforts using Logistic Regression
An exhaustive logistic regression analysis was conducted for all the seventy brands in the data set and a
number of interesting results were observed.
Varying Success in Predictive Efforts
Logistic regression models were built to make predictions of each brand usage based on the demographics
and model scores only. We observed varying degrees of success across various brands ranging from highly
successful such as results for Breyer’s Ice Cream to poor results as seen for Lipton Tea Bags. We present
depicting typically good, moderate and poor results.
Good: Breyers Ice Cream
Medium: Caress Body Wash
Poor: Lipton Tea Bags
Figure 17: Lift Curves for Different Unilever Brands
Some of the conclusions to be made are as follows:
a)
For a majority of brands noticeable lift was observed. Typical lift curves over a wide range of
products are similar to the Caress body wash lift curve shown above. This indicates that the
demographic and model score variables contain considerable predictive power in this data set and
can be used for making brand usage predictions.
b) The most important predictive variables emerged to be the Golden Question Model (GQM) and
Demographic Model (DM) scores. There are two possible explanations for this. The first is that
MVC may be an important summary statistic that captures brand usage. The second is that the
Golden Question Model takes a number of brand usage variables as inputs. Thus using the GQM
scores to further predict the same usage variables can lead to artificially inflated results that may
be misleading.
22
Massachusetts Institute of Technology
c)
The following demographics are seen to have significant presence in models for a wide range of
brands:
Significant Demographics
Age
Length of residence
Gender
Marital status
Household Members
Region Code
Figure 18: Significant Demographic Parameters
d) Brands with high coefficients in the GQM computation had significantly better lift curves. As
already noted, this may be deceptive. Hence the lift charts for the brands which have been used as
inputs in the computation of GQM must be viewed with caution. The Breyers ice cream in an
example of such a brand.
e)
Brands for which response rate was below 10% of the total consumer base led to poor predictive
efforts, as in the case of Lipton Tea This can be attributed to insufficient number of data entries
available for training and validating the predictive models, leading to poor results.
Example of a Predictive Model
The following chart is an example of logistic regression predictive model for one of the Unilever brands –
namely Gorton Fillets. The diagram graphically indicates t – scores for the model.
Figure 19: Graphical Representation of Logistic Regression for Gorton Fillets
23
Massachusetts Institute of Technology
The most important model coefficients represented above are as follows:
Target: Breyers Ice Cream
Logistic Regression Model
Model Score GQM
4.7005
Model Score DM
-2.521
Intercept
-3.0433
Absence of Female Adult
0.2977
Gender
-0.358
Age
0.0183
Absence of Unknown Adult
0.400
Home Renter
0.215
Figure 20: Logistic Regression Model Coefficients for Gorton Fillets
During the course of study, it was made evident that GQM and DM scores held significant predictive
information. For a majority of products, model coefficients were the highest for these score variables. As
explained earlier, since GQM and DM scores were fit using Panel data. Their importance in models for the
same brands on a different data set verifies a degree of similarity between the panel data and Non–Survey
data, as well as establishes the importance of GQM and DM scores in predictive efforts.
Modeling With and Without GQM Scores
Further analysis was carried out to judge the contribution of GQM and DM score variables in predictive
models. We wished to explore the comparative performance of models without the GQM and DM scores,
based on demographics only. The figure below indicates superior performance of the models based on both
demographics and model scores compared to models based on demographics only. (In Figure 21, “Reg”
represents the model excluding GQM scores and “Reg 2” represents model based on demographics only for
the prediction of Breyer’s Ice Cream usage).
Figure 21: Comparative Lift Curves for Models With and Without GQM Scores
24
Massachusetts Institute of Technology
3.5
GQM Score – Based Stratified Analysis
As documented previously, Unilever spent considerable efforts in identifying the “Most Valuable
Consumers” (MVC) on a brand and company-wide level. Given the Golden Question Model scores and
Demographic Model scores for each consumer, we were keen to enhance our predictive efforts using this
information. The exhaustive logistic regression analysis had already convinced us that GQM and DM
scores could be highly instrumental in predicting brand usage. A two pronged approach was followed in the
MVC based analysis:
3.5.1 Stratified Prediction Models
Previous modeling attempts were focused on fitting a single model for each brand to the entire data set and
generating posterior probabilities using these models. Since GQM scores seemed to contain information on
MVC, we stratified the data set based on model scores into three categories. Firstly a separate training data
set was created. The partition was based on consumer distribution such that each stratum had roughly one –
third consumers. All consumers with a GQM score greater than 0.679 were categorized as high GQM
consumers. All consumers with a GQM score between 0.679 and 0.342 were categorized as medium GQM
consumers and the rest were categorized as low GQM consumers. Separate logistic regression models were
built for each of these three training data sets. Based on the strata cutoffs for the training set, validation data
set was also divided into three corresponding data sets. Models fit on the respective training data sets were
applied to the validation sets to compute posterior probabilities for each consumer in all the categories.
Thus we generated 2 sets of posterior probabilities for each consumer – from the overall model and from
the stratified analysis. Lift charts were drawn for both predictive efforts to compare performance.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
30
40
Stratified Analysis
50
60
Baseline
70
80
90
100
Overall Analysis
Lift Curve comparing GQM Stratified and overall
analysis
Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models
Figure 22 shows lift charts for GQM score–based stratified and overall models for Dove bar soap, which
are representative for other brands as well. It is clear that both the methodologies lead to similar results. We
25
Massachusetts Institute of Technology
noted that there was insignificant difference in the three separate models created for the stratified data sets
and that these models were each close to the overall model as well. This claim was further substantiated by
details of the model coefficients in each case. It was observed that the important variables in overall model
and stratified models were the same for a single brand. There was only a slight variation in the coefficients
for each of these variables. Thus we concluded that the stratification of data set according to GQM does not
lead to improved results.
3.5.2 Predicting GQM Strata for Each Consumer
We have discussed the reason which prevented us from thoroughly investigating the concept of MVC given
the nature and type of data available to us. We found no direct indications of MVC in the Unilever data set,
rather GQM and DM based estimates of MVC. Nevertheless we spent efforts using GQM scores as a target
for our predictive models. Instead of predicting the exact GQM scores, based on demographics we tried to
predict whether the consumer belonged to high, medium or low GQM strata. Our intention was to generate
a representative MVC model using Non–Survey data based on demographics only. Logistic regression was
used to arrive at results. GQM strata definitions were maintained the same as mentioned afore. Each
consumer was assigned a strata number one, two or three depending on whether it belonged to the high,
medium or low GQM strata. Subsequently predictive models were computed to predict the strata class for
each consumer based on demographic variables only.
A very low degree of success was achieved in predicting the GQM score strata that each consumer
belonged to. For this reason and unclear applicability of this model, we did not find it appropriate to pursue
this line of analysis further.
4
CLUSTERING ANALYSIS
4.1
Clustering Background
Clustering places objects into groups or clusters suggested by the data. The objects in each cluster tend to
be similar to each other in some sense, and objects in different clusters tend to be dissimilar. The
observations are divided into clusters so that every observation belongs to at most one cluster. Clustering
not only reveals inherent data characteristics by identifying points of similarity or dissimilarity in the data
set, it aids in understanding data structure issues. If dissimilar data sets are aggregated to produce a bigger
data set, clustering of the aggregated set might reveal underlying data sets. One of the added advantages of
clustering analysis is that it can be applied to a data set with missing values as well.
Aside from data cleaning and data structure issues, clustering results can also be of interest by identifying
groups of consumers with similar traits, who may be targeted in a similar fashion. Additionally, we
explored the possibility of improved prediction by modeling the individual clusters separately and then
aggregating results.
26
Massachusetts Institute of Technology
Clustering can be performed according to various methods. For our analysis we chose Ward’s Method
which is somewhat more sophisticated than the popular but simple k–means method. Ward’s method is an
iterative method that seeks to minimize the statistical spread of observations within a cluster. In this
method, the distance between two clusters is the ANOVA sum of squares between the two clusters added
up over all the variables. At each iteration, the within–cluster sum of squares is minimized over all
partitions obtainable by merging two clusters from previous iteration. Clustering can be performed
according to various measures of spread or distance among the data points. During our study we used the
Least Squares method because it works fastest while dealing with large data sets.
During the clustering analysis SAS Enterprise Miner computes an Importance Value between 0 and 1 for
each variable in the data set. This represents the measure of worth of the given variable in the formation of
clusters. While the data is split into clusters, Importance Value of each variable indicates the extent to
which each variable was influential in the splitting process. An importance of 0 indicates that this variable
was not used as splitting criteria for clustering and an Importance value of 1 indicates that this variable had
the highest worth in splitting criteria.
One of the most important tools which help us interpret individual clusters is the Input Mean Chart for each
cluster. This allows a comparison of the variable mean for selected clusters to the overall variable means.
The input means are normalized using a scale transformation function:
y=
x − min( x)
max( x) − min( x)
For example assume 5 input variables yi = y1,…,.y5 and 3 clusters C1, C2, C3. Let the input mean for
variable Yi in cluster Cj be represented by Mij. Then the normalized mean, or input mean, SMij becomes:
SM ij =
M ij − min( M i1 , M i 2 , M i 3 )
max(M i1 , M i 2 , M i 3 ) − min( M i1 , M i 2 , M i 3 )
The input means are normalized to fall in a range from 0 to 1. For each cluster input means are ranked
based on magnitude of difference between the input means for the selected cluster(s) and the overall input
means. The variables with the highest spreads typically best characterize the selected clusters(s). Input
means that are very close to the overall means are not very helpful in describing the unique attributes of
consumers within the selected cluster(s).
4.2
Cluster Analysis Details
27
Massachusetts Institute of Technology
We performed a clustering analysis to identify distinctive groups of consumers based on their brand usage
responses only. All the seventy brands were used as input variables for the clustering analysis. The
clustering analysis led to a few noteworthy insights and enhanced our understanding of the data. Analysis
was performed over the entire Non–Survey data set. We repeated the clustering several times for various
randomly sampled subsets of the Non–Survey data to ensure generality and accuracy of the results. The
repeated runs of the clustering algorithm over these various subsets gave similar results, and seemed to
indicate an obvious clustering into 3 groups. Results shown below are for the clustering of the entire Non–
Survey data set.
Figure 23: Cluster Pie Chart
Figure 23 depicts the three clusters obtained as the distinct pie sections which are indicative of the
frequency of data points in each cluster. The color of each pie section is reflective of the root mean square
standard deviation of points in each cluster and its value has been provided in Figure 24
Cluster
% Data
Cluster Std
#
Points
Deviation
1
38
0.246
2
8.7
0.269
3
53.3
0.194
Cluster Description
Frequent Buyers of food
Infrequent buyers of
item
soaps/ washes
Frequent Buyers of
Buyers of items with
detergents
low response rate
Frequent Buyers of
Infrequent buyers of
body soaps/ washes
food items
Figure 24: Cluster Statistics
Figure 24 shows some statistical details of each cluster. These include:
28
Massachusetts Institute of Technology
•
Percentage Data Points – Percentage of data points in each cluster (Total number of data points –
8,608).
•
Cluster Standard Deviation – Average standard Deviation of each point in the cluster from cluster
mean, which indicates within-cluster spread of the data.
•
Cluster Description – A qualitative description of distinguishing features of each cluster
The cluster descriptions are based on the input means plots and importance values described previously.
Another tool used in the process was a decision tree that is created in the clustering analysis. It is somewhat
similar to the classification trees. A set of simple variables–based rules is generated which approximates
the cluster boundaries. The following section sheds light on methods of deciphering the crucial features of
distinction amongst all the clusters and discussion on similarity traits within each cluster.
4.3
Inference from Cluster Analysis
4.3.1 Effects of Unbalanced Brand Representation
Figure 25 depicts importance values for the most critical variables based on which the clustering of
data points was carried out.
Brands
Importance Value of Brands
Suave fabric cond
GHB ice cream
Ragu pasta sauce
Dove body wash
Surf heavy duty lq
All laundry powder
Caress body wash
Oofo bar soap
Dial bar soap
Pond’s face
Oofo body wash
Caress bar soap
Dove body wash
0
0.2
0.4
0.6
0.8
1
Imporatance Value
Figure 25: Importance Value of Significant Variables in Clustering
We find that the personal wash products are the most important variables for the clustering analysis. We
also noticed that the frequency of brand usage affects the importance value, with typically all the brands
with low reported usage having a low importance value. Thus we note that the importance values are
related to the unbalanced brand representation in the data set.
29
Massachusetts Institute of Technology
4.3.2 Details of Cluster 1
Cluster 1 consists of 38% of the consumers (3,271). The above chart presents means of particular brands
within cluster 1 as compared to the overall mean of the same brands over the entire data set. It is to be
noted that complimentary brand usage response has been modeled (Thus blocks to the left of the mean
actually indicate that the cluster has a higher incidence of response for that product). For instance the chart
shows 55% of the consumers indicate a positive response for Dove body wash in the overall data set but
this is true for only 32% of the consumers in cluster 1. This means that the number of Dove body wash
buyers is significantly more than the average consumer in the dataset.
Figure 26: Input Means for Cluster 1
We conclude that consumers in cluster 1 display a significantly high propensity for purchase of body
hygiene products including bar soaps and body washes. Some of the main products purchased by these
individuals in the order of significance are: Dove body wash, Dial bar soap, Caress bar soap, Oil of Olay
body wash etc. Another noticeable attribute is that they are also infrequent buyers of food items like Ragu
pasta sauce, Wishbone salad dressing, Gorton fillets, etc.
4.3.3 Details of Cluster 2
Cluster 2 consists of only 8.7% of the consumers (748). Cluster 2 consists of individuals that are infrequent
buyers of both the personal wash and food items. They seem to be purchasers of items with very low
response rate like Ponds Face, VICL Body, VSLN Body and some of the detergents. It is possible that these
responses were collected from a data set of beauty products and laundry products purchasers, or these are
simply outliers in the data set.
30
Massachusetts Institute of Technology
Figure 27: Input Means for Cluster 2
4.3.4 Details of Cluster 3
Cluster 3 consists of 53% consumers (4588). The most important attribute of individuals in this cluster is
that they are highly frequent buyers of food items compared to average consumers. Important food items
include Ragu Pasta Sauce, Wishbone salad dressing, Breyers Ice cream, Gorton Fillets. These are
coincidentally also characterized as infrequent buyers of personal hygiene products. For example they are
infrequent buyers of Caress Bar Soap, Dove Body Wash, Dial Bar Soap, Lever 2000 Bar Soap, etc.
Fig 28: Input Means for Cluster 3
31
Massachusetts Institute of Technology
Based on the consumer attributes revealed by the clustering analysis, we conclude that the clustering found
may be approximating the underlying data sets that have been aggregated. We found it intriguing to
observe that the cluster of individuals with higher propensity for purchasing body hygiene products should
have a significantly lower tendency for purchasing food products and that the cluster of consumers with a
high propensity for purchasing food items should have a lower propensity for purchasing body hygiene
products. Also, the few outlying individuals who are neither frequent purchaser of personal wash or food
items have been clustered separately. Possibly information from body hygiene brands like Dove and food
items like Ragu Pasta Sauce were collected separately and aggregated together in one data set, which would
explain the clusters that we observe. As mentioned several times before, body soap and wash products
overwhelm the usage data. Therefore the clustering analysis also segregates frequent and infrequent buyers
of these items into separate clusters. Based on the likelihood that the data available to us has been
aggregated from various sources and the differences in the source may be guiding the clustering analysis,
extrapolation of the results to true usage behavior may be inappropriate. This means that the clusters we
have generated may accurately reflect groups in the data but not be indicative of the true consumer usage
pattern.
4.5
Cluster–Wise Predictive Modeling
Regardless of the reason for emergence of data clusters, they may potentially lead to improved prediction
results. Here we investigate whether fitting a separate model for each cluster can lead to better results. If
accuracy is improved, it suggests that separate clusters of consumers should be preferably modeled
individually.
Once it was identified that there exists a possibility of prior data aggregation leading to artificial structure,
we explored a cluster–wise predictive methodology to obtain better results. The objective was to conclude
whether agglomeration of data sets was desirable or data for separate brands and product categories should
be treated separately.
For this purpose, each data cluster was treated as separate data set and logistic regression models were built
for each. Thus predictive probabilities were generated for each data point according to the cluster it
belonged to. Probabilities for each of these data entries were also generated through an overall model fitted
to the entire agglomerated data set. Lift curves for generated for both the analysis and compared.
Figure 29 shows comparative lift charts for cluster–wise and overall analysis for Dove Bar Soap, which is
typical of lift curves observed for other brands too. This analysis proves that a cluster-wise predictive
model is capable of giving significantly than a predictive model based on the entire agglomerated data set.
This observation is in contrast to the weak results we obtained by separately modeling consumers in
different GQM score strata.
32
Massachusetts Institute of Technology
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
30
Cluster Wise Analysis
50
70
Baseline
90
Overall Analysis
Lift Curvess comparing Cluster-wise and Overall
Analysis
Figure 29: Comparative Lift Curves for Cluster – Based and Overall Models
The conclusion to be drawn from this analysis is that modeling small homogenous groups in the data
independently is a more useful exercise than fitting an overall model to an agglomerated data set. If our
clusters are reflecting underlying data sets from different product categories that have been agglomerated to
form the Unilever Database, then it is more advantageous for Unilever to analyze its various data sets
independently.
5
PROJECT SUMMARY AND CONCLUSIONS
We conclude the report with a brief summary of what has been accomplished. Below we have enumerated
some overall conclusions and recommendations based on all our analysis.
A single extract of the Unilever database was made available to us for the purpose of developing modeling
methodologies useful for Unilever’s business, to identify actionable insights from the data, and to evaluate
the data as an asset for targeted marketing. In addition to sizeable efforts trying to understand, clean, and
prepare the data, we focused on generating predictive models of individual brand usage because such
models are arguably the most valuable tool for targeted marketing. Our efforts and interactions with
Unilever representatives gave us a better understanding of the data which in turn led to more refined
analyses on subsets of the data. Finally, we generated clustering models to identify groupings in the data,
whether due to natural brand usage patterns or induced artificially through combination of various data sets.
We list a number of overall trends and conclusions that arise out of our analyses:
•
There is a need to understand the content of the Unilever database more fully. This includes
gaining greater understanding of the effect of missing data and assessment of the quality of some
of the third party information. Also, the Unilever database includes some information on dates
33
Massachusetts Institute of Technology
and usage quantities. The availability of such data could potentially lead to more interesting
analyses.
•
Logistic regression appears to be a suitable method for most of the prediction tasks undertaken. It
has the benefits of being well-studied for these types of data and of being interpretable. Its
performance was comparable to more complicated methods for most of the tasks we tried.
Overall, we observed that data quality issues seemed more important than the choice of modeling
methodology.
•
The Unilever data seems to be aggregated from several different data sources, many of which
seem to be incompletely understood and/or poorly documented. We note that the quality and
applicability of any data analysis depends critically on the quality of the underlying data and the
extent to which it is understood.
•
Our prediction efforts seem to indicate that in general, predictions of self-reported brand usage
based on MVC estimations and on usage of other products were somewhat effective. We believe
that in this data set there is a potential problem with using MVC estimates to predict usage of
certain individual brands, as these MVC estimates are sometimes based on the same data we are
trying to predict. Demographic variables seemed to have less predictive power.
•
Many of our prediction and clustering analyses seemed to be heavily influenced by data
aggregation effects. In many cases, these effects may have dominated any underlying consumer
usage effects. Due to this, our results may be representative of the data we are working with but
extrapolate poorly to actual usage situations.
•
A strategy of clustering consumers first, then fitting prediction models gave improved prediction
results. This suggests that data may best be analyzed in a cluster-wise manner. If clusters are
representative of underlying data sources, then these data sources may be better analyzed
individually rather in an aggregated fashion.
•
While the Unilever database provides information on a large number of consumers, it is our
opinion that modeling and insight generation about consumer behavior is best performed on a
cleaner data set that is better understood and more uniformly gathered. Examples include the
panel data to which Unilever already has access and data gathered by retailers of Unilever brands.
34
Massachusetts Institute of Technology
Download