Clementine® Application Template for Analytical CRM in

Clementine Application Template
for Analytical CRM in
Telecommunications 7.0
®
For more information about SPSS® software products, please visit our Web site at
http://www.spss.com or contact
Marketing Department
SPSS Inc.
233 South Wacker Drive, 11th Floor
Chicago, IL 60606-6307
Tel: (312) 651-3000
Fax: (312) 651-3668
SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for
its proprietary computer software. No material describing such software may be produced or
distributed without the written permission of the owners of the trademark and license rights in the
software and the copyrights in the published materials.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use,
duplication, or disclosure by the Government is subject to restrictions as set forth in
subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at
52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor,
Chicago, IL 60606-6307.
General notice: Other product names mentioned herein are used for identification purposes only
and may be trademarks of their respective companies.
This product includes software developed by the Apache Software Foundation
(http://www.apache.org).
Windows is a registered trademark of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
DataDirect, INTERSOLV, SequeLink, and DataDirect Connect are registered trademarks of
MERANT Solutions Inc.
Clementine® Application Template for Analytical CRM in Telecommunications 7.0
Copyright © 2002 by Integral Solutions Limited
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the
prior written permission of the publisher.
Contents
1
Introduction to Clementine Application
Templates
6
What Is a Clementine Application Template? . . . . . . . . . . . . . . . . 6
2
Introduction to the Telco CAT
8
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Introduction to Analytical CRM . . . . . . . . . . . . . . . . . . . . . . . . . 9
Life Cycle Model for CRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Why Analytical CRM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3
Getting Started
11
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CAT Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CAT Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CAT Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
How to Use the CAT Streams . . . . . . . . . . . . . . . . . . . . . . . . . 14
Notes on Reusing CAT Streams . . . . . . . . . . . . . . . . . . . . . . . 15
3
4
Working with CAT Streams
16
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Module 1--Churn Application . . . . . . . . . . . . . . . . . . . . . . . . . 16
P1_aggregate.str-Aggregate Call Data and Merge with Customer Record. . . .
P2_value.str--Customer Value and Tariff Appropriateness . .
E1_explore.str--Visualize Customer Information and Value . .
E2_ratios.str--Visualize Derived Usage Category Information
P3_split.str-Derive Usage Category Information and Train/test Split. . . .
M1_churnclust.str-Customer Clustering and Value/churn Analysis . . . . . . . .
M2_churnpredict.str--Model Propensity to Churn . . . . . . .
D1_churnscore.str--Score Propensity to Churn . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
19
21
23
26
. . . 29
. . . 31
. . . 34
. . . 36
Module 2--Cross-Sell Streams . . . . . . . . . . . . . . . . . . . . . . . . 37
P4_basket.str--Produce Customer Product Basket Records . . .
P5_custbasket.str--Merge Customer, Usage, and Basket Data .
E3_products.str--Product Association Discovery . . . . . . . . .
M3_prodassoc.str--Customer Clustering and Product Analysis .
E4_prodvalue.str--Product Groupings Based on Customer Value
M4_prodprofile.str--Propensity to Buy Grouped Products . . . .
D2_recommend.str-Product Recommendations from Association Rules . . . . . . . .
Appendix A
Telco CAT Data Files and Field Names
.
.
.
.
.
.
39
40
41
43
45
47
. 48
51
Raw Data Files for Modules 1 and 2 . . . . . . . . . . . . . . . . . . . . . 51
Intermediate Data Files for Module 1 . . . . . . . . . . . . . . . . . . . . 53
Intermediate Data Files for Module 2 . . . . . . . . . . . . . . . . . . . . 57
4
Appendix B
Using the Data Mapping Tool
60
Mapping New Data to a CAT Stream . . . . . . . . . . . . . . . . . . . . .60
Mapping Data to a Template .
Mapping between Streams. .
Specifying Essential Fields . .
Examining Mapped Fields. . .
.
.
.
.
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.61
.63
.64
.65
67
5
Chapter
1
Introduction to Clementine
Application Templates
What Is a Clementine Application Template?
A Clementine Application Template (CAT) is a collection of materials for use with
the Clementine data mining system that illustrates the techniques and processes of
data mining for a specific application. These materials are centered around a library
of Clementine streams (or stream diagrams) that illustrate the data mining techniques
commonly used in the selected application. The purpose of these streams is for you to
study and reuse them in order to simplify the process of data mining for your own
similar business application.
Reusing an application template is simply a matter of fit. Although every data
mining application is different, there are many techniques commonly used throughout
data mining (such as propensity modeling, clustering, and profiling). This means that
certain data mining processes, such as clustering, will apply to almost all data mining
projects. However, if your industry is similar to the one used for a particular CAT, you
will likely be able to use even more of the illustrated techniques. For example, when
a Clementine stream is constructed to perform a particular task in an application, there
is often enough structural similarity to allow reuse of the stream in a similar
application. Reusing the stream can save you significant amounts of time and effort.
Clementine application templates are designed to make use of this similarity
between related data mining projects by providing sample projects to guide you.
Within a particular application type (such as customer relationship management, or
CRM, for the telecommunications industry), there are many standard tasks and
analyses that you can reuse if applicable to your data mining project.
6
7
Introduction to Clementine Application Templates
A Clementine application template consists of:
n A library of Clementine streams.
n Synthetic data that allow the streams to be executed for illustrative purposes
without modification. CAT data are supplied in flat files to avoid dependence on a
database system. The data used in the CATs may be classified into two types: raw
and intermediate. Raw data files are the starting point of each CAT. Intermediate
files can be generated by the preprocessing streams supplied.
n A user’s guide that explains the application, the approach and structure used in the
stream library, the purpose and use of each stream, and how to apply the streams
to new data.
Chapter
2
Introduction to the Telco CAT
Overview
The Telco CAT is a Clementine application template for analytical customer
relationship management (CRM) in the telecommunications industry. It illustrates the
data mining techniques applicable to churn management and cross-selling described
below:
n Preprocessing. This phase of analytical CRM, or data mining, handles the merging
and aggregation of customer and call data, the derivation of customer value and
tariff-related fields, and the preprocessing steps for producing "basket-style"
customer product data.
n Exploration. This phase uses a wide range of exploratory techniques, such as
histograms and distribution charts, to understand the overall properties of the data,
including the factors that influence customer churn and product purchase.
n Modeling and analysis. This phase illustrates the use of clustering and profiling to
understand customer churn and assist targeted cross-selling. Additionally,
predictive techniques are used in Module 1 to predict the occurrence of churn, and
association discovery is used in Module 2 for cross-selling.
n Deployment. The final phase illustrates the use of the Clementine Solution
Publisher to deploy churn prediction and cross-sell recommendation techniques.
The techniques used in these data mining phases help to answer the business questions
typically encountered in the telecommunications industry. The Telco CAT will help
you to see how this is accomplished.
8
9
Introduc tion to the Telco CAT
Introduction to Analytical CRM
In the modern business world, customer focus is increasingly important. The unit of
business activity has become the customer relationship as a whole, rather than the
individual sale. The concept of customer relationship management arises from the fact
that to be successful, businesses must manage not only the processes of production and
distribution but also the customer relationships themselves.
Customer relationship management (CRM) has two components--an operational
component and an analytical one.
n Operational CRM ensures that the operational aspects of the business treat the
customer relationship as a unit--for example, by making all of the information
about interactions with a particular customer available at every customer touchpoint.
n Analytical CRM provides a greater understanding of customers, both individually
and as a group, allowing the business to meet the needs of the customer at all levels,
from individual transactions to overall strategy.
Data mining is at the heart of analytical CRM because it is used to uncover the hidden
meaning in customer interactions, allowing businesses to understand their customers
and predict what they will do.
Life Cycle Model for CRM
To understand how CRM benefits a business, take a closer look at the nature of the
customer relationship. A customer relationship is like a story with a beginning, a
middle, and an end. At each point in the story, or "customer life cycle," CRM can focus
on a particular goal:
n At the beginning, to attract more and more-profitable customers.
n In the middle, to maximize the value of each customer to the business.
n At the end, to delay or reduce the loss of valuable customers.
These goals are summarized in the following graph:
10
Chapter 2
Figure 2-1
Customer Life Cycle, Value, and CRM
Why Analytical CRM?
Data mining for CRM, also known as analytical CRM, can assist a business in
achieving all of the benefits discussed in this guide--better customer acquisition, better
cross-selling, and better customer retention. Specifically, analytical CRM can help
your business in the following ways:
n Improved customer acquisition through an increased understanding of customer
segments and value. Specific segments, sometimes of a particularly high value, can
be targeted in campaigns.
n Improved cross-selling through a better understanding of customer segments and
their relationship to product purchase. Knowledge of customers allows you to
understand what they are likely to buy.
n Improved customer retention by understanding when and why customers are likely
to leave the customer base, enabling you to take remedial action where appropriate.
The Telco CAT illustrates analytical CRM to achieve all of these benefits through
specific applications of data mining.
Chapter
3
Getting Started
Overview
The Telco CAT is structured into two modules or "virtual applications" that explore
Clementine operations typical to the telecommunications industry.
n Module 1 is a churn application designed to increase customer retention.
n Module 2 is a cross-sell application that processes product information and merges
it with customer information from Module 1 for more targeted cross-selling.
Each application consists of a number of streams that work either from raw data files
or from intermediate files produced by the preprocessing streams.
CAT Data
The data provided with the CAT are based on a fictitious telecommunications
company; the data is entirely synthetic and bears no relation to any real company.
The raw data files are:
custinfo.dat
Basic customer information
cdr.dat
Call data aggregated by month
tariff.dat
Details of the tariff scheme in use
products.dat
Table of products or services purchased by each
customer
11
12
Chapter 3
The Telco CAT also contains six intermediate data files produced by stream operations.
In several cases, these intermediate data files are then used in other stream operations.
CAT Streams
The two modules of the Telco CAT consist of 15 streams. The streams are organized
according to the Cross-Industry Standard Process for Data Mining (CRISP-DM)
methodology and contain a prefix indicating the appropriate data mining phase. For
example, P1_aggregate.str is the first stream used in the preprocessing phase to
aggregate data. As illustrated in the table below, the prefix codes used for Telco CAT
streams are: P - preprocessing, E - exploration and understanding, M - modeling, and
D - deployment.
CRISP-DM Phase
Prefix
Code
Module 1
Streams
Module 2
Streams
Total
Streams
Data Preparation
P
3
2
5
Data Understanding
E
2
2
4
Modeling & Evaluation
M
2
2
4
Deployment
D
1
1
2
8
7
15
Totals
CAT Modules
The Telco CAT is grouped into two modules that illustrate the types of applications in
analytical CRM for telecommunications.
Module 1
Module 1 is a churn application specific to the telecommunications industry. It
consists of eight streams used to explore several data sets, prepare them for modeling,
and create and deploy churn prediction models. This module uses several steps to move
from the exploration of data to modeling and deployment:
Preprocessing streams produce a merged and augmented data file, cust_call_plus.dat,
which describes both customers and their behavior.
13
Getting Started
Several more streams perform a number of explorations and clustering exercises on
cust_call_plus.dat.
Additional preprocessing occurs and the data set is split into training and test data files
(train and test) used for predictive modeling.
Several clustering and association rule models are built in the modeling stream.
Finally, a deployment stream illustrates the deployment of a churn prediction model
that runs directly from the raw data. First, it performs all the required preprocessing
and then scores the customer base on propensity to churn.
For more information, see "Module 1--Churn Application" in the chapter Working with
CAT Streams.
Module 2
Module 2 is a cross-sell application specific to the telecommunications industry. It
uses seven streams to perform product analysis and cross-sell recommendations. This
module uses the sequence of explorations and operations listed below to produce a
recommendation model:
A preprocessing stream runs from the raw "till-roll" data (products.dat) that lists the
separate purchases made by each customer. This stream then produces a "basket" form
of the data with one record per customer (cust_prod.dat).
Further preprocessing merges the basket data with the customer/behavioral data from
Module 1 (churn application) to produce a final preprocessed file (cust_call_prod).
Various explorations and modeling exercises are performed on the preprocessed data.
Unlike those of the churn application, the cross-sell streams often make direct use of
the raw data.
As a final step, a deployment stream illustrates the techniques of product
recommendation using an association model.
For more information, see "Module 2--Cross-Sell Streams" in the chapter Working with
CAT Streams.
14
Chapter 3
How to Use the CAT Streams
There are two ways to use the streams that ship with the CAT:
n Use streams as examples or illustrations of techniques to study as you build your
own business-specific application. Simply load the streams into Clementine,
execute them on the data provided with the CAT, and examine their composition
using the stream information provided in this guide.
n Use the CAT streams as prepackaged components that you can attach to your
existing data. With the minor modifications detailed in this guide, you can use the
templates directly for your own data mining applications.
To determine which method is better for your business needs, address the
questions below.
How well does the CAT match your technical situation?
To determine a technical match, you should address issues such as:
n Data format
n Organization of the database
n Overlay between the attributes used in the CAT data and those available from
your data
It is not necessary that the match between the CAT data and your data be exact before
the streams can be reused. For example, additional customer information, such as
postal code, may be used without any need to change the stream. On the other hand,
the inclusion of new usage categories that are used in various preprocessing steps,
would require minor changes to some streams (for example, the addition, deletion, or
modification of Derive nodes). A completely different organization of data in the
database might require significant restructuring of the streams or the addition of new
preprocessing streams to bring the data organization into line with that of the CAT.
How well does CAT answer your business questions?
To answer the second question, you should determine whether the specific questions
addressed by the CAT match your business questions, such as What is the relationship
between churn and dropped calls? You may have business reasons for believing that
this relationship is not relevant to your specific situation, in which case, you could omit
15
Getting Started
this exploration as you use the streams. On the other hand, you may want to address
other business questions that are not considered in the Telco CAT. In this case, you
would want to supplement the CAT with additional streams of your own. As a general
rule, the greater the divergence of your business questions from those addressed in the
CAT, the greater the likelihood that streams will need to be modified or used for
illustration rather than reuse.
Notes on Reusing CAT Streams
In general, the CAT streams can be reused to perform the same function with broadly
similar data. The Data Mapping tool reduces the need for hand editing when reusing
streams; however, in some cases, hand editing is necessary.
For more information, see "Mapping New Data to a CAT Stream" in the chapter
Using the Data Mapping Tool.
When reusing streams, you should also consider the following:
n In Server mode, the file paths for Source nodes may not be valid by default. You
may need to specify a different path in the Source node dialog box.
n When publishing a stream, be sure to check that the output file path is not set to the
$CLEO directory.
Chapter
4
Working with CAT Streams
Overview
Now that you have had an introduction to data mining in the telecommunications
industry, you are ready to go into greater detail. In this chapter, you can examine indepth the streams of each module. You can see how the data are prepared and how the
models are built. Read on for a closer look at how the Telco CAT works.
Module 1--Churn Application
The streams in Module 1 illustrate a churn prediction data mining application. As with
any data mining application, there are data preparation and exploration phases.
However, the core propensity modeling takes three primary approaches:
n Cluster the customers and look for high-churn clusters.
n Build rules or profiles that describe those customers likely to churn.
n Build scoring models indicating the degree to which a customer is similar to those
who have churned.
The first approach is illustrated in M1_churnclust. The second and third approaches
are shown in M2_churnpredict. Any of these approaches produce models that can be
deployed in a churn prediction application. In this case, deployment is illustrated
using a neural net scoring model (in the stream D1_churnscore).
16
17
Working with CAT Streams
The following diagram illustrates how the streams fit together to comprise the churn
application.
Figure 4-1
Data files and streams in Module 1
Telco CAT Module 1 - Churn Application
E1_explore.str
P1_aggregate.str
Cust_calls.dat
P2_value.str
Cust_call_plus.dat
E2_ratios.str
P3_split.str
M1_churnclust.str
Cust_info.dat
Cdr.dat
Tariff.dat
D1_churnscore.str
Modeling
Train data
Test data
M2_churnpredict.str
Key:
Data
Stream
P1_aggregate. The first preprocessing stream takes two raw data files (custinfo and cdr)
and produces an intermediate file (cust_calls.dat). Three preprocessing steps are
performed:
n Aggregate the monthly call data into six monthly totals.
n Produce averages and various combined fields from these totals.
n Merge the customer information with this aggregated call data.
P2_value. The second preprocessing stream merges the intermediate file cust_calls and
the tariff details file to produce a new intermediate file cust_call_plus.dat. This new file
deals with the higher-level issues of customers’ total spending and the appropriateness
of the tariffs. The stream also compares what each customer spends with what they
would have spent on the "next higher" tariff and flags those who would be "better off"
on a higher tariff.
18
Chapter 4
E1_explore1. A number of visualizations are performed in this stream that examines the
churn indicator against a number of attributes considered likely to be relevant to churn
behavior. The goal is to get a picture of the "shape" of the data before more detailed
analyses are undertaken.
E2_ratios. This stream performs explorations that require preprocessing. These
explorations fall into five categories:
n Usage--How do usage bands, unused phones, and gender relate to churn?
n Ratios--How do the relations between different usage categories relate to churn?
n Handset--Do different types of handsets have different churn patterns?
n Dropped calls--How does the rate of dropped calls relate to churn?
n Tariff--Do tariff and tariff appropriateness have a relation to churn?
M1_churnclust. At this point in the module, business questions might focus on the
relationship between certain churn and spend groups. This stream attempts to answer
some of these questions as it produces clustering models and examines the relation of
the discovered clusters to churn and customer value (total spending). Then, it produces
rule-based profile models of the clusters. Characterizing the relevant clusters will allow
churn reduction campaigns to be targeted accurately. Profiling the high-churn groups
will also help you understand the reasons for churn. The derived fields from the
explorations in E2_ratios are included in this analysis via the SuperNode called
added_fields.
P3_split. This stream prepares the augmented customer data, cust_call_plus, for
predictive modeling. The fields from the explorations in the stream E2_ratios are
added by a SuperNode, and then the data is split randomly in half as training and test
data sets.
M2_churnpredict. This is the main predictive modeling stream for the churn
application. It builds a number of different predictive churn models using the training
data and then compares their performance on the test data.
D1_churnscore. This stream illustrates the deployment of scoring models using a neural
net scoring model as an example. It is important to note that, because this stream will
be deployed to run outside the context of Clementine, it must perform all of the
preprocessing from the raw data independent of any intermediate files.
19
Working with CAT Streams
P1_aggregate.str--Aggregate Call Data and Merge with Customer Record
This is the first data preparation step for churn analysis. This stream takes monthly call
data and aggregates it into six-month totals and then merges it with static customer
information.
Figure 4-2
Stream diagram for P1_aggregate.str
Stream Notes
Telco data is usually segmented into several different tables. For the purpose of data
mining, these tables need to be combined into a single one. In this example, there are
two types of data: CDR data (call data records) and customer information (length of
service, tariff, age, handset, etc.).
CDR data usually exists at several levels of aggregation:
n The lowest level is individual calls, which are usually too fine-grained for data
mining purposes.
n The next level is monthly aggregate calls by type (peak, off-peak, weekend, and
international). This type of data is often used for billing purposes.
The Telco CAT data set includes CDR data at the monthly level and contains call
minutes and the number of calls for each call type. This data is aggregated further to
give a six-month average that smooths out monthly fluctuations and is a more reliable
indicator of usage.
20
Chapter 4
The SuperNode Avgs & Counts derives additional fields for average call times
(AvePeak, AveOffPeak, AveWeekend, etc.). From this analysis, you can see that clients
who make longer calls and possess certain other attributes form a significant segment
of the customer base.
Figure 4-3
Detailed view of SuperNode Avgs & Counts
Also derived in the SuperNode are All_calls_mins (sum of all minutes used for all call
types) and a total and average length for national calls (all call types except
international). These will be used to derive usage ratios in the exploration streams.
Figure 4-4
Deriving All_calls_mins
21
Working with CAT Streams
P2_value.str--Customer Value and Tariff Appropriateness
During the second phase of data preparation, this stream adds tariff information to
allow calculation of total customer revenue and value. It also calculates a tariff
appropriateness flag.
Figure 4-5
Stream diagram for P2_value.str
Stream Notes
In this stream, the previously merged customer/call information is merged again with
the tariff details table so that each customer is tagged with the details of the type of
tariff they are on.
The SuperNode Tariff approp then calculates the several factors associated with
cost. Cost has two elements: the fixed cost of the tariff and the cost of the calls. All the
tariffs have some free minutes of call time, so customers pay only for calls over and
above their free minutes. International calls are not included in free minutes.
Call_cost_per_min is a calculation of the cost of all national calls before free minutes
divided by the total number of national minutes. The higher the proportion of off-peak
and weekend calls, the smaller this number will be.
22
Chapter 4
The SuperNode also calculates whether the user is on the correct tariff. In other words,
you could ask Would their cost be less on the tariff above? The answer is calculated
by comparing the difference between the tariff fixed charges with the amount spent on
(nonfree national) calls. The calculation assumes that the general type of tariff (Play
or CAT) is always correct but that the customer might be on the wrong tariff within
that type (for example, on CAT 50 when he or she should be on CAT 100). The
SuperNode also creates "usage bands" so that usage can be categorized for certain
types of analysis.
Figure 4-6
Customer usage bands for National mins
By examining the graphs, you can learn that customers flagged as high for their tariff
are more likely to churn.
23
Working with CAT Streams
Figure 4-7
Distribution of tariff appropriateness overlaid with churn
E1_explore.str--Visualize Customer Information and Value
Data exploration is the first step in a churn analysis. This stream shows typical
exploratory analyses on both raw and augmented data.
Figure 4-8
Stream diagram for E1_explore.str
24
Chapter 4
Stream Notes
The upper part of the stream shows analysis of attributes from the raw data. The lower
part looks at aggregated or derived fields.
Throughout the exploration phase, a range of indicator fields is examined. In each
graph, the fields are overlaid with the churn flag to determine if there is a simple
relationship between the variable and churn. Typically, what you look for here is an
increase or decrease in churn behavior that might indicate different customer segments.
For example, a histogram might reveal a trend or a specific band where churn is
stronger. Similar to distributions, these values might be associated with a higher churn
rate than others.
Many of the churn segments in this module are characterized by a combination of
several variables. Typically, single variable graphs like those used in this stream do not
show any obvious trends. In this particular data set, however, you can see some
patterns emerge.
The following list describes the conclusions drawn from several of the graphs in
this stream:
Age. In general, younger people tend to churn at a higher rate, and this is reflected in
the data set.
L_O_S (length of service). Churn often occurs just after a contract expires. The key
period in this data set is 12-15 months.
Handset. The cost to the customer of a handset is often subsidized. In general, service
providers want customers to keep their handset for at least a year to recoup this cost.
People with older handsets tend to churn because it is often a cheap way to upgrade
their handsets. People with high-tech handsets may want to upgrade to the latest
version as soon as it is available. For high-tech handsets, this period is normally less
than six months. In this data set, the handsets have different product codes. The hightech handsets are ASAD with the larger version number being the newest.
Dropped calls. This graph indicates service quality (although it can also be related to
handset problems). Clients with high dropped calls tend to churn.
25
Working with CAT Streams
Figure 4-9
Relation of dropped calls to churn
Tariff. Some tariffs may be more vulnerable to churn than others because competitors
offer a more attractive package for this usage segment. Low-cost users in the cheapest
tariffs tend to churn more as they find that even their current tariff is too expensive.
Total_Cost (the total spending of the customer). Low-spending customers tend to churn
more. Call cost without tariff (actual call cost) reveals this pattern even more strongly
(this excludes international calls). Usage also reveals this trend more clearly in the
usage bands than in the histogram (All_call_mins).
Usage fields. These fields are worth examining in separate graphs in order to check for
trends in usage segments related to churn. The presence of trends may depend on
particular tariff structures compared to competitors and type of call (for example,
international, peak, off-peak, weekend).
The fields examined in this stream are those one might examine in any churn
application. As data exploration shows, not all of the fields have identifiable patterns,
but some do. These relationships will help you determine the next step for your data
mining project.
26
Chapter 4
E2_ratios.str--Visualize Derived Usage Category Information
This stream performs deeper data exploration by adding high-level attributes to the
data set that enable the identification of market segments with a relation to churn.
Figure 4-10
Stream diagram for E2_ratios.str
Stream Notes
One way of characterizing churn is to partition the customer base into usage segments
and then analyze these for propensity to churn. Deriving higher-level attributes helps
this process and enables simpler rules that describe the segments and predict churn.
The upper right part of this stream derives a set of usage ratios that can be used to
describe customers in terms such as high offpeak calls when a large proportion of their
calls are off-peak.
Four ratios are derived:
Peak ratio
Peak minutes to national minutes
OffPeak ratio
Off-peak minutes to national minutes
Weekend ratio
Weekend minutes to national minutes
Nat-Internat ratio
International minutes to national minutes
27
Working with CAT Streams
The ratios show the proportion of all calls by those in a particular category (rather
than ratios between categories). This method avoids the distortion caused by very
low numbers when "pure" ratios are used. Examining the graphs for these ratios,
you can see:
n In general, the relation of these ratios to churn will depend on the tariff structure
and the competitive environment. Some tariffs favor peak or off-peak calls.
n In this data set, the OffPeak ratio is related to tariff and to high-churn
customer segments.
Figure 4-11
Graph of weekend ratio overlaid with churn
The upper left part of the stream, as viewed in the stream canvas, explores the
phenomenon of no usage. A no usage flag identifies those customers who have not
used their phones in the period covered by the data. This segment has a higher
propensity to churn, with the exception of people who use their phones for emergency
purposes only. The Select node called No Usage explores this segment in more detail.
The lower left part of the stream examines the relation of tariff and the tariff
appropriateness indicator to churn.
28
Chapter 4
The middle right part of the stream examines the churn properties of different
handsets. The distribution of handsets shows that some handsets are particularly
associated with churn. The churn score and aggregate branch of the stream calculates
a churn score for each handset (the average churn fraction) and ranks handsets in order
of score.
Figure 4-12
Distribution of churn with handset type
The lower right part of the stream examines dropped calls. You can explore the
Dropped_calls histogram and generate a Derive node to flag records with a high
number of dropped calls. The resulting distribution graph, high Dropped calls, shows
the increased propensity to churn in this group.
The derived fields in this stream will be included in the next preprocessing stage.
29
Working with CAT Streams
P3_split.str--Derive Usage Category Information and Train/test Split
This stream prepares the data for modeling. It adds the additional fields derived when
exploring the data and randomly splits the data into a test and training set.
Figure 4-13
Stream diagram for P3_split.str
Stream Notes
The SuperNode added fields adds the higher level attributes derived in e2_ratios.
Figure 4-14
SuperNode stream segment
The Derive node called Split generates a random number (either 1 or 2). This field is
used to partition the data set into training and test subsets that are written into separate
files. These files are used for modeling in M1_churnclust.str.
30
Chapter 4
Figure 4-15
Derive node generating a random number
The stream also contains a Type node that can be saved and reused in the predictive
modeling stream. This step is necessary to ensure that the Type node in use for
modeling has been instantiated with all the data (and not just the training subset).
31
Working with CAT Streams
M1_churnclust.str--Customer Clustering and Value/churn Analysis
This stream clusters the data-producing customer segments using two different
clustering techniques. The resulting clusters are analyzed for value and
propensity to churn.
Figure 4-16
Stream diagram for M1_churnclust.str
Stream Notes
Clustering is an alternative to predictive analysis and can give you insight into the
"natural" segments in the data. For example, if high-churn clusters can be identified,
business actions such as special offers might be tailored to that segment. Clustering can
also be used for value analyses such as identifying high spending clusters and crosssell opportunities.
The SuperNode added fields adds the attribute fields that were derived in
e2_ratios. Two clustering techniques are then used: Kohonen and K-means. The
32
Chapter 4
upper part of the stream analyzes the Kohonen clustering, and the lower part
analyzes the K-means clustering.
The Kohonen network produces a two-dimensional grid of clusters. The Derive
node Cluster No labels these fields, making it possible to analyze individual clusters.
The top branch of the stream adds a churn score (either 0 or 1) and then aggregates to
show the average churn per cluster. These clusters are then ranked and displayed in the
table called churn.
Figure 4-17
Table illustrating clusters ranked by churn score
A related branch of the Kohonen network (ending in the table called value) calculates
the average value per cluster and ranks the clusters in order to identify high-spending
clusters. The relation of the clusters to churn and value is also examined by
visualization. Another related branch uses C5.0 rule induction to create a ruleset that
profiles the clusters. This model is called Cluster No.
Comparison of the value, churn, and profile associated with particular clusters can
give you detailed insight into the customer base. For example, cluster 11 is a highchurn, medium-value cluster associated with high-usage males with certain tariffs
and handsets.
33
Working with CAT Streams
Figure 4-18
Clusters segmented by customer value
The lower part of the stream performs a similar analysis for a K-means cluster model.
The relationship of the clusters to value and churn are explored, the value and churn
cluster rankings are calculated, and the clusters are profiled using C5.0.
34
Chapter 4
M2_churnpredict.str--Model Propensity to Churn
This is the main stream for predictive churn modeling. It builds several models based
on the training data and evaluates them on the test data set.
Figure 4-19
Stream diagram for M2_churnpredict.str
Stream Notes
This stream builds four different models that predict churn:
n C5.0 ruleset
n C&RT decision tree
n Logistic regression model
n Neural network
The C5.0, C&RT, and logistic regression models are categorical in that they make a
yes/no prediction for churn. The neural network is a scoring model because the churn
flag is replaced by a number (0.0 or 1.0, calledchurn score) used as the prediction
35
Working with CAT Streams
target. The neural network thus predicts churn on a continuum between 0 and 1. The
C5.0 and C&RT predictions are also converted into scores between 0 and 1 using the
confidence values (visualized in the histograms Score C and Score R). Logistic
regression models produce probability fields that can be used directly for scoring.
When using multiple models, you might ask How do I know which model to
select for scoring?
To answer this question, you can compare the performance of the three categorical
models using the Analysis node in the lower stream and the nearby evaluation chart
that compares the gains curve of all three models. The performance of the neural
network is analyzed separately in the lower part of the stream, but the performance of
all the models can be compared in the lower evaluation chart.
Figure 4-20
Evaluation chart comparing all four models
A second issue to consider is the likely value of the clients that may churn. The plot of
Total_Cost v. $N-Churnscore is useful for devising a campaign matrix. The first clients
to contact would be those with high value and high score (most likely to churn)
followed by high value and medium score, and medium value and high score.
At this point, you have completed all the data preparation and modeling for Module 1
and should have enough information to facilitate decision making. However, if you
would like to deploy these streams independently of the Clementine application, read
on for more information.
36
Chapter 4
D1_churnscore.str--Score Propensity to Churn
This is an example model deployment stream. Starting from raw data, it performs all
the necessary preprocessing to score customers using the neural network model built
in stream M2_churnpredict. This stream can be "published" using the Clementine
Solution Publisher, producing a deployable scoring application.
Figure 4-21
Stream diagram for D1_churnscore.str
Stream Notes
The deployment stream runs independently of Clementine and therefore has to
combine all the operations needed to create the data in a single stream. Three input files
are required: cdr.dat (call data), custinfo.dat (customer information), and tariff.dat.
The lower left part of the stream duplicates the processing of p1_aggregate.str, the
upper left part duplicates p2_value.str, and the second SuperNode duplicates the fields
derived in p3_split.str. The data is then run through the neural net scoring model from
m2_churnpredict.str.
The final Publisher node is used to generate the stand-alone application.
37
Working with CAT Streams
Module 2--Cross-Sell Streams
Module 2 contains streams that illustrate the general structure of a cross-sell data
mining application. As with the Module 1 churn application, there are preprocessing
and exploration phases. The main product opportunity analysis has a slightly different
structure from that of the churn application because there is no one event or product on
which to focus the application. Instead, this module focuses on:
n Associations between products (illustrated in E3_products)
n Groupings of products (illustrated in E4_prodvalue and M4_prodprofile)
n Groupings of customers (illustrated in M3_prodassoc)
In all of these manipulations, you should focus on discoveries that will allow you to
predict purchasing patterns at a higher level than individual products. Again, any of
these approaches can be deployed to make purchase recommendations (that is, to
indicate likely purchases for individual customers). In this module, the deployment of
association rules for recommendations is illustrated in D2_recommend.str.
Figure 4-22
Data files and streams in Module 2
Telco CAT Module 2 - Cross-sell Application
M3_prodassoc.str
Products.dat
P4_basket.str
Cust_prod.dat
P5_custbasket.str
Cust_call_prod.dat
E4_prodvalue.str
Groupings
D2_recommend.str
Association rules
Cust_call_plus.dat
from Module 1
E3_products.str
M4_prodprofile.str
Key:
Data
Stream
38
Chapter 4
P4_basket. This stream performs a simple set-to-flag or basket transformation on the
raw till-roll style customer/product information. This process produces a basket data
format with one record per customer and one flag field per product. The stream uses
products.dat and produces cust_prod.dat.
E3_products. This stream explores the relationships between product purchases using a
web display and association rule modeling (Apriori).
D2_recommend. This stream illustrates how association rules can be deployed to
produce recommendations, or likely purchases for customers based on what they have
already purchased.
P5_custbasket. This stream combines the basket-style data produced by P4_basket.str
(cust_prod.dat) with the augmented customer/call information (cust_call_plus.dat) to
produce a new file (cust_call_prod.dat).
M3_prodassoc. This stream builds a Kohonen clustering model based on customer and
call information and then explores the relationships between the clusters and product
purchases. The goal is to discover groups of customers with propensities for certain
purchases. This stream is potentially useful for cross-selling recommendations.
E4_prodvalue. This stream explores the relationship between product purchase and total
customer spending (or value). Value-related product groups are discovered.
M4_prodprofile. This stream profiles the value-related product groups discovered in
E4_prodvalue in terms of customer and call information. The goal is to discover
profiles of customers likely to buy the products in each group. The stream illustrates
how customers predicted to buy the products in a group can be selected as targets for a
cross-selling campaign.
39
Working with CAT Streams
P4_basket.str--Produce Customer Product Basket Records
This is the initial stream of the Module 2 cross-sell application. It takes the raw product
information (one record for every product sold) and produces a single basket record for
each customer.
Figure 4-23
Stream diagram for P4_basket.str
Stream Notes
The raw product data is in the form of a till-roll, where each record links a customer
ID to one product purchased. The Set-to-Flag node takes the set of all products and
creates a flag field for each product and then aggregates by customer ID. The result is
a "basket" record for each customer, containing T in the fields for products they have
purchased and an F in the other (nonpurchased) product fields.
Executing this stream will produce a new Source node called cust_prod.dat. This
Source node is then merged with data from another source as shown in the next stream,
p5_custbasket.str.
40
Chapter 4
P5_custbasket.str--Merge Customer, Usage, and Basket Data
In this second phase of data preparation, you will merge the augmented customer call
data from Module 1 with the product data used in this module. This synthesis enables
more detailed analysis of product purchases and customer profiles.
Figure 4-24
Stream diagram for P5_custbasket.str
Stream Notes
The Merge node combines the basket data with the cust_call_plus data. This latter data
set includes both call data records and customer information, such as demographics.
The Derive node splits customer value into five bands to assist the exploration of
relationships between products and customer value or spending.
41
Working with CAT Streams
Figure 4-25
Splitting customer value into bands using a Derive node
E3_products.str--Product Association Discovery
Following the CRISP model, this stream moves from data preparation to the data
understanding and exploration phases. In E3_products.str, you can explore the
relationships within product purchases.
Figure 4-26
Stream diagram for E3_products.str
42
Chapter 4
Stream Notes
The stream explores the purchasing relationships (which products are purchased
together) in the basket data. The Web node called Products looks at pairwise
associations between products.
Figure 4-27
Web analysis of product associations
The Apriori node called Products performs a basket analysis and can confirm these
binary patterns while discovering more complex (multiproduct) purchasing patterns.
You can tune the Apriori node, changing the thresholds to control the number of
relationships found. The Web node can be used to estimate appropriate thresholds for
confidence and coverage.
43
Working with CAT Streams
M3_prodassoc.str--Customer Clustering and Product Analysis
During data mining, it is typical to build several models while exploring the data. This
stream builds a Kohonen clustering model based on call usage and customer
information and then analyzes the clusters in terms of product purchases. The purpose
is to identify cross-selling opportunities.
Figure 4-28
Stream diagram for M3_prodassoc.str
Stream Notes
In this stream, the Kohonen network is used to cluster customers based on call usage
data (behavioral data) and customer information. The rest of the stream analyzes the
characteristics of these clusters.
n The lower left part of the stream calculates the total and average customer value for
each cluster.
n The right side of the stream merges the clustered customer data with the individual
product list. This step helps you to characterize each cluster in terms of products
purchased. To identify cross-selling opportunities, you can select clusters that have
a high proportion of purchases of the product of interest and then try to sell the
product to remaining clients in that cluster who have not purchased the product.
44
Chapter 4
The Matrix node called cluster x Product breaks down the sales of each product by
cluster, thus highlighting the high-selling clusters for each product.
n The upper right branch sorts product purchases by cluster. Products are sorted in
terms of the number purchased in each cluster, showing the top-selling products for
each cluster.
The distribution graphs at the bottom of the stream canvas help to clarify the analysis:
n Cluster is a distribution of clusters overlaid with products, indicating the relative
importance of the different products in each cluster. For example, products 11 and
12 are relatively unimportant in cluster 02 but relatively important in cluster 32.
n ValueBand gives a distribution of value band overlaid by product. For
example, products 11 and 12 are relatively important in certain areas, such as
low-value bands.
n Finally, Product shows the converse relationship, the relative importance of the
different value bands to each product. In the distribution graph, the product
groupings are clearly visible. For example, products 1-4 are associated with highvalue customers.
Figure 4-29
ValueBand distribution showing products purchased in each value band
45
Working with CAT Streams
E4_prodvalue.str--Product Groupings Based on Customer Value
This stream explores the relationships between customer value bands and products
purchased. It also derives value-based product groupings.
Figure 4-30
Stream diagram for E4_prodvalue.str
Stream Notes
The upper part of the stream analyzes the distribution of products purchased in
different value bands. The Directed Web node called ValueBand x Products shows
these relationships. Four groups of product associations, revealed by this web and the
previous stream (e3_products), have been coded into the Derive nodes as flags for
groups 1-4. The relations between these groups and the value bands are then explored
in the directed web called ValueBand x Groups.
46
Chapter 4
Figure 4-31
Web map of product groups and value bands
The lower part of the stream uses the raw product records and merges them with the
total cost and value band information in order to count products per customer.
The upper branch of the stream calculates the average number of products per
customer in each of the five value bands and ranks the bands in terms of this average.
The lower branch calculates the total for each product purchased in each value band.
The aggregated data is then sorted to give a ranked list of value bands for each product.
This ranking helps answer questions such as For a given product, in which value bands
does it sell best?
47
Working with CAT Streams
M4_prodprofile.str--Propensity to Buy Grouped Products
This stream builds C5.0 rulesets showing the profiles for purchasers of the product
groups identified in e4_prodvalue.str.
Figure 4-32
Stream diagram for M4_prodprofile.str
Stream Notes
Previous streams (e3_products.str and e4_prodvalue.str) have been used to discover
product groupings or sets of products that customers purchase together. In this
stream, these groups are flagged by derived indicators in the SuperNode called
Groups. Such flagging is helpful for cross-selling, for example, when there is a threeproduct grouping and you have a number of customers with only two of the three
products. You could use this information to identify these customers and to offer
them the third product.
This stream builds profiles for three of the four identified product groupings using
customer behavioral and descriptive information. Some groups (in this case, group 1)
produce no useful profile. The three models built from this data have quite different
characters: the model for group 2 is very simple, the model for group 4 is moderately
complex, and the model for group 3 is very complex. This complexity appears to have
48
Chapter 4
an inverse relationship to the quality of the model. In other words, the simpler the
model, the more accurate (as illustrated by the Analysis node results) and better the
gains chart is (as shown by the evaluation chart).
Figure 4-33
Evaluation of different propensity models
The models, once built, can be used to predict which clients will buy a particular
product grouping. Similarly, those who have not purchased all products in their group
can be targeted for cross-selling. The lower right branch of the stream shows the
selection of targets for group 2 products.
D2_recommend.str--Product Recommendations from Association Rules
Following the CRISP-DM model, the final phase of most data mining projects is
deployment. This stream implements a recommendation engine for association rules
that can be deployed outside of Clementine. It compiles a basket of items for a client
and uses the association ruleset to recommend additional items to purchase sorted in
order of rule confidence.
49
Working with CAT Streams
Figure 4-34
Stream diagram for D2_recommend.str
Stream Notes
The upper part of the stream produces an association rule model and converts it into a
form where it can be used for product recommendations. The lower stream uses the
converted rules to recommend additional products for a user’s "basket" - the basket of
products already purchased can be provided using a User Input node. The format of the
input basket is a single record for each product purchased containing a user id and the
product. This input format allows you to make recommendations for multiple users
simply by substituting a file containing user id and product purchases (the file
products.dat provides such an example.)
There are actually three streams in this file. The top left stream generates the
association rules using Apriori. The unrefined association rule model will appear in the
Generated Models palette of the Managers window. To complete the first stage of rule
preparation, you should browse the model and select Show criteria, then export the
model as a text file (in this case assoc_rules.txt).
50
Chapter 4
The second stream converts the ruleset into a form that can be used for
recommendations. The association rules are saved in the form:
instances support confidence consequent antecedent1 ...
Components are separated by tabs. In this case, the rules are interpreted to mean that if
the basket contains the antecedents then purchasing the consequent is recommended
(provided they haven’t already purchased it). The antecedents may contains one or
more items. The stream assigns a rule number (Rule) to each rule and converts the rules
into a record for each condition. It also adds a variable (Conds) which is the total
number of conditions in each rule. The stream can process rulesets that have up to three
conditions. (Further branches would have to be added for rules with more than three
conditions.)
The stream determines whether the conditions consist of one, two or three items by
examining the antecedent fields. All rules have at least one condition (Cond1), so every
rule/record passes thround the upper branch; the second and third branches are used
only if there are second and third conditions (Cond2 and Cond3). The different
branches select these cases and produce a separate record for every condition; the rule
number (Rule) and the total number of conditions (Conds) are attached to each
condition. These condition records are appended together into the file conditions.txt,
which is used in the final stream for recommendations.
The lower stream is the recommendation engine. A user ID and basket is entered via
the User Input node. The basket items are entered in single quotes ’01’ ’02’ etc.
separated by spaces. This produces a record for each product purchased. The Derive
node Condition converts each product into the same form as it appears in the conditions
file from the association rules. The user basket is then merged with the conditions. All
conditions that appear in the user basket will be matched. The resultant "matched
conditions" data is then aggregated by user and rule number (Rule) and only those rules
where all the conditions are matched are retained (Select node matched).
The Derive node AllProducts accumulates the customer basket and selecting the last
record for a customer yields the total product basket. (Note: this will work for multiple
customers.) The total basket (AllProducts) is merged with the matched rules and those
rules which recommend products that are already in the basket are discarded. The
remaining rules are sorted on customer ID and rule confidence; the Distinct node then
discards any products that have been recommended for the same customer more then
once. The recommendations, already sorted in order of confidence, are displayed in the
table. The Clementine Solution Publisher node can be used to deploy this stream as a
standalone recommendation engine.
Appendix
A
Telco CAT Data Files and
Field Names
Raw Data Files for Modules 1 and 2
Data file: custinfo.dat
Field name
Explanation
Customer_ID
Unique customer key
Gender
Sex--male or female
Age
Age in years
Connect_Date
Date phone was "connected"--start of customer relationship
L_O_S
Length of service in months (since connect date)
Dropped_Calls
Number of dropped calls during 6-month period
Pay Method
Method of payment--either pre- or post-paid
tariff
Tariff type
Churn
Flag--churned or active
Handset
Name of handset type
51
52
Appendix A
Data file: cdr.dat
Field name
Explanation
Customer_ID
Unique customer key
Peak_calls
Number of peak-time calls in month indicated
Peak_mins
Number of peak-time call minutes in month indicated
OffPeak_calls
Number of off-peak calls in month indicated
OffPeak_mins_Sum
Number of off-peak minutes in month indicated
Weekend_calls
Number of weekend calls in month indicated
Weekend_mins
Number of weekend minutes in month indicated
International_mins
Number of international-call minutes in month indicated
Nat_call_cost_Sum
Cost of national calls (peak + off-peak + weekend) in month
indicated
month
The month described by the record--6 months supplied for each
customer
Data file: tariff.dat
Field name
Explanation
tariff
Tariff type
fixed_cost
Fixed monthly cost for this tariff type
Free_mins
Number of free (national) call minutes for this tariff type
peak_rate
Cost per minute for peak-time calls beyond free minutes for this
tariff type
OffPeak_rate
Cost per minute for off-peak calls beyond free minutes for this
tariff type
Weekend_rate
Cost per minute for weekend calls beyond free minutes for this
tariff type
International_rate
Cost per minute for international calls for this tariff type
Voicemail
Cost of voicemail service (not used)
SMS
Cost of SMS service (not used)
Data file: products.dat
Field name
Explanation
Customer_ID
Unique customer key
Product
One product bought by this customer (a customer may have
several rows)
53
Telco CAT Da ta Files and Field Names
Intermediate Data Files for Module 1
Data file: cust_calls.dat--new fields added by p1_aggregate.str
Field name
Explanation
Customer_ID
Inherited from custinfo.dat
Gender
Age
Connect_Date
L_O_S
Dropped_Calls
Pay Method
tariff
Churn
Handset
Peak_calls_Sum
Total number of peak-time calls in 6-month period
Peak_mins_Sum
Total number of peak-time call minutes in 6-month period
OffPeak_calls_Sum
Total number of off-peak calls in 6-month period
OffPeak_mins_Sum
Total number of off-peak minutes in 6-month period
Weekend_calls_Sum
Total number of weekend calls in 6-month period
Weekend_mins_Sum
Total number of weekend minutes in 6-month period
International_mins_Sum
Total number of international-call minutes in 6-month period
Nat_call_cost_Sum
Total cost of national calls (peak + off-peak + weekend)
AvePeak
Average duration of peak-time calls during 6-month period
AveOffPeak
Average duration of off-peak calls during 6-month period
AveWeekend
Average duration of weekend calls during 6-month period
National_calls
Total number of national calls in 6-month period
National mins
Total number of national minutes in 6-month period
AveNational
Average duration of national calls during 6-month period
All_calls_mins
Total number of call minutes in 6-month period (national +
international)
54
Appendix A
Data file: cust_call_plus.dat--new fields added by p2_value.str
Field name
Explanation
Customer_ID
Inherited from custinfo.dat
Gender
Age
Connect_Date
L_O_S
Dropped_Calls
Pay Method
tariff
Churn
Handset
Peak_calls_Sum
Inherited from cust_calls.dat
Peak_mins_Sum
OffPeak_calls_Sum
OffPeak_mins_Sum
Weekend_calls_Sum
Weekend_mins_Sum
International_mins_Sum
Nat_call_cost_Sum
AvePeak
AveOffPeak
AveWeekend
National_calls
National mins
AveNational
All_calls_mins
Usage_Band
A banding of national call minutes
Mins_charge
Number of chargeable national call minutes in 6-month
period (national minutes - free minutes)
Continued
55
Telco CAT Da ta Files and Field Names
Field name
Explanation
call_cost_per_min
Cost of national calls per minute ignoring free minutes
actual call cost
Cost of national calls after free minutes removed--indicates
call mix
Total_call_cost
actual call cost + cost of international calls
Total_Cost
Total call cost + fixed cost of tariff
Tariff_OK
Flag to indicate tariff appropriateness
average cost min
Total cost / all call minutes (average call cost per minute
including tariff cost and international calls)
Data files: train.dat and test.dat--new fields added by p3_split.str--also derived and
explored in e2_rations.str and m1_churnclust.str
Field name
Explanation
Customer_ID
Inherited from custinfo.dat
Gender
Age
Connect_Date
L_O_S
Dropped_Calls
Pay Method
tariff
Churn
Handset
Peak_calls_Sum
Inherited from cust_calls.dat
Peak_mins_Sum
OffPeak_calls_Sum
OffPeak_mins_Sum
Weekend_calls_Sum
Weekend_mins_Sum
International_mins_Sum
Nat_call_cost_Sum
AvePeak
Continued
56
Appendix A
Field name
Explanation
AveOffPeak
AveWeekend
National_calls
National mins
AveNational
All_calls_mins
Usage_Band
Inherited from cust_call_plus.dat
Mins_charge
call_cost_per_min
actual call cost
Total_call_cost
Total_Cost
Tariff_OK
average cost min
Peak ratio
Ratio of peak-time minutes / national minutes
Offpeak ratio
Ratio of off-peak minutes / national minutes
Weekend ratio
Ratio of weekend call minutes / national minutes
Nat-InterNat Ratio
Ratio of international call minutes / national minutes
High Dropped calls
Number of dropped calls above threshold
No usage
Client has made 0 calls in the 6-month period
57
Telco CAT Da ta Files and Field Names
Intermediate Data Files for Module 2
Data file: cust_prod.dat
Field name
Explanation
Customer_ID
Unique customer key
Product_01
For each product, a flag indicating whether the customer bought this
product (one record per customer)
Product_02
Product_03
Product_04
Product_05
Product_06
Product_07
Product_08
Product_09
Product_10
Product_11
Product_12
Data file: cust_call_prod.dat--created by p5_custbasket.str
Field name
Explanation
Customer_ID
Inherited from custinfo.dat
Gender
Age
Connect_Date
L_O_S
Dropped_Calls
Pay Method
tariff
Churn
Handset
Continued
58
Appendix A
Field name
Explanation
Peak_calls_Sum
Inherited from cust_calls.dat
Peak_mins_Sum
OffPeak_calls_Sum
OffPeak_mins_Sum
Weekend_calls_Sum
Weekend_mins_Sum
International_mins_Sum
Nat_call_cost_Sum
AvePeak
AveOffPeak
AveWeekend
National_calls
National mins
AveNational
All_calls_mins
Usage_Band
Inherited from cust_call_plus.dat
Mins_charge
call_cost_per_min
actual call cost
Total_call_cost
Total_Cost
Tariff_OK
average cost min
Product_01
Inherited from cust_prod.dat
Product_02
Product_03
Product_04
Product_05
Product_06
Continued
59
Telco CAT Da ta Files and Field Names
Field name
Product_07
Product_08
Product_09
Product_10
Product_11
Product_12
Explanation
Appendix
B
Using the Data Mapping Tool
Mapping New Data to a CAT Stream
Using the mapping tool, you can connect new data to a pre-existing stream. A
common use is to replace the source node defined in a Clementine Application
Template (CAT) with a source node that defines your own data set. The Mapping tool
will not only set up the connection but will also help you to specify how field names
in the new source will replace those in the existing template. In essence, mapping data
results simply in the creation of a new Filter node, which matches up the appropriate
fields by renaming them.
There are two equivalent ways to map data:
Select Replacement Node. This method starts with the node to be replaced. First, you
select the node to replace; then, using the Replacement option from the context menu,
select the node with which to replace it. This way is particularly suitable for mapping
data to a CAT.
Map to. This method starts with the node to be introduced to the stream. First, select
the node to introduce; then, using the Map option from the context menu, select next
the node to which it should join. This way is particularly useful for mapping to a
terminal node. Note: You cannot map to Merge or Append nodes. Instead, you should
simply connect the stream to the Merge node in the normal manner.
60
61
Using the Data Mapping Tool
Figure B-1
Selecting data mapping options
In contrast to earlier versions of Clementine, data mapping is now tightly integrated
into stream building, and if you try to connect to a node that already has a connection,
you will be offered the option of replacing the connection or mapping to that node.
Mapping Data to a Template
To replace the data source for a template stream with a new source node bringing your
own data into Clementine, you should use the Select Replacement Node option from
the Data Mapping context menu option. This option is available for all nodes except
Merge, Aggregate, and all terminal nodes. Using the data mapping tool to perform this
action helps ensure that fields are matched properly between the existing stream
operations and the new data source. The following steps provide an overview of the
data mapping process.
Step 1: Specify Essential Fields in the original source node. In order for stream operations
to execute properly, essential fields should be specified. In most cases, this step is
completed by the template author. For more information, see "Specifying Essential
Fields" below.
62
Appendix B
Step 2: Add new data source to the stream canvas. Using one of Clementine’s source
nodes, bring in the new replacement data.
Step 3: Replace the template source node. Using the Data Mapping options on the
context menu for the template source node, choose Select Replacement Node. Then
select the source node for the replacement data.
Figure B-2
Selecting a replacement source node
Step 4: Check mapped fields. In the dialog box that opens, check that the software is
mapping fields properly from the replacement data source to the stream. Any
unmapped essential fields are displayed in red. These fields are used in stream
operations and must be replaced with a similar field in the new data source in order for
downstream operations to function properly. For more information, see "Examining
MappedFields" below.
After using the dialog box to ensure that all essential fields are properly mapped, the
old data source is disconnected and the new data source is connected to the template
stream using a Filter node called Map. This Filter node directs the actual mapping of
fields in the stream. An Unmap Filter node is also included on the stream canvas. The
Unmap Filter node can be used to reverse field name mapping by adding it to the
stream. It will undo the mapped fields, but note that you will have to edit any
downstream terminal nodes to reselect the fields and overlays.
63
Using the Data Mapping Tool
Figure B-3
New data source successfully mapped to the template stream
Mapping between Streams
Similar to connecting nodes, this method of data mapping does not require you to set
essential fields beforehand. With this method, you simply connect from one stream to
another using the Data mapping context menu option, Map to. This type of data
mapping is useful for mapping to terminal nodes and copying and pasting between
streams. Note: Using the Map to option, you cannot map to Merge, Append, and all
types of source nodes.
Figure B-4
Mapping a stream from its Sort node to the Type node of another stream
64
Appendix B
To map data between streams:
Right-click the node that you want to use for connecting to the new stream.
From the context menu, select:
Data mapping
Map to
Use the cursor to select a destination node on the target stream.
In the dialog box that opens, ensure that fields are properly matched and click OK.
Specifying Essential Fields
When mapping to a Clementine Application Template, essential fields will typically be
specified by the template author. These essential fields indicate whether a particular
field is used in downstream operations. For example, the existing stream may build a
model that uses a field called Churn. In this stream, Churn is an essential field because
you could not build the model without it. Likewise, fields used in manipulation nodes,
such as a Derive node, are necessary to derive the new field. Explicitly setting such
fields as essential helps to ensure that the proper fields in the new source node are
mapped to them. If mandatory fields are not mapped, you will receive an error
message. If you decide that certain manipulations or output nodes are unneccessary,
you can delete the nodes from the stream and remove the appropriate fields from the
Essential Fields list.
Note: In general, template streams in the Solutions Template Library already have
essential fields specified.
To set essential fields:
Right-click on the source node of the template stream that will be replaced.
From the context menu, select Specify Essential Fields.
65
Using the Data Mapping Tool
Figure B-5
Specifying Essential Fields dialog box
Using the Field Chooser, you can add or remove fields from the list. To open the Field
Chooser, click the icon to the right of the fields list.
Examining Mapped Fields
Once you have selected the point at which one data stream or data source will be
mapped to another, a dialog box opens for you to select fields for mapping or to ensure
that the system default mapping is correct. If essential fields have been set for the
stream or data source and they are unmatched, these fields are displayed in red. Any
unmapped fields from the data source will pass through the Filter node unaltered, but
note that you can map non-essential fields as well.
Figure B-6
Selecting fields for mapping
66
Appendix B
Original. Lists all fields in the template or existing stream--all of the fields that are
present further down stream. Fields from the new data source will be mapped to these
fields.
Mapped. Lists the fields selected for mapping to template fields. These are the fields
whose names may have to change to match the original fields used in stream
operations. Click in the table cell for a field to activate a drop-down list of available
fields.
If you are unsure of which fields to map, it may be useful to examine the source data
closely before mapping. For example, you can use the Types tab in the source node to
review a summary of the source data.
Index
acquisition, 10
Analysis node, 34
analytical CRM, 9
benefits of, 10
Apriori, 41
association rules, 48
life cycle, 9
operational, 9
cross-sell application
details, 37
overview, 12
cross-selling, 10
customer
acquisition, 10
retention, 10
C&RT, 34
C5.0, 31, 34
CAT
guidelines for use, 14
reusing streams, 15
Telco CAT data, 11
Telco CAT modules, 12
Telco CAT overview, 8
Telco CAT streams, 12
Telco CAT structure, 11
categorical models, 34
CDR data, 19
churn, 34
churn analysis, 23
churn application
details, 16
overview, 12
churn score, 26
Clementine Application Templates (CATs)
data mapping tool, 60
clustering, 31
in Module 1, 16
in Module 2, 37
CRISP-DM, 12
CRM
analytical, 9
data mining, 9
introduction, 9
d1_churnscore.str, 36
d2_recommend.str, 48
data
connecting to streams, 15
merging from both modules, 40
overview, 11
train/test sets, 29
data exploration
in Module 1, 23, 26
data files
intermediate data files, 53, 57
list of, 51
data mapping tool, 60, 61
data mining, 9
benefits of, 10
stream phases, 12
data preparation
in Module 1, 19, 21, 29
in Module 2, 39, 40
data understanding, 41
deployment
in Module 1, 36
in Module 2, 48
67
68
Index
e1_explore.str, 23
e2_ratios.str, 26
e3_products.str, 41
e4_prodvalue.str, 45
essential fields, 60, 64
evaluation charts, 34
exploration
in Module 2, 41, 45
field names
list of, 51
K-means, 31
Kohonen network
in Module 1, 31
in Module 2, 43
Module 1
details, 16
Module 2
details, 37
modules
types of, 12
operational CRM, 9
overview
Telco CAT, 8
p1_aggregate.str, 19
p2_value.str, 21
p3_split.str, 29
p4_basket.str, 39
p5_custbasket.str, 40
pairwise associations, 41
predictive analysis, 31
length of service, 23
logistic regression, 34
m1_churnclust.str, 31
m2_churnpredict.str, 34
m3_prodassoc.str, 43
m4_prodprofile.str, 47
mandatory fields, 65
mapping data, 15, 64
mapping fields, 60
Matrix node, 43
merging data, 40
modeling
in Module 1, 31, 34
in Module 2, 43, 47
train/test sets, 29
models
categorical, 34
scoring, 34
ratios, 26
recommendation, 48
retention, 10
reusing streams, 15
scoring models, 34
in Module 1, 16
in Module 2, 37
segmentation, 10, 26
solutions template library, 60
source nodes
data mapping, 61
streams
details, 16
guidelines for use, 14
overview, 12
reusing with your data, 15
tariff information, 21
69
Index
telecommunications CAT
overview, 8
structure, 11
template fields, 65
templates, 60, 61
test data set, 29
till-roll, 39
train data set, 29
unmapping fields, 60
usage bands, 21
value bands, 45