FROM BUSINESS OBJECTIVES TO DATA
MINING: TOWARDS A SISTEMATIC WAY OF
DATA MINING PROJECT DEVELOPMENT
Facultad de Informática
Ernestina Menasalvas
Facultad de Informática
Universidad Politecnica de Madrid. Spain emenasalvas@fi.upm.es
November 2004
• 1995: doctoral student.
– Visit University of Regina (Prof. Ziarko)
– Visit Warsaw University (Prof. Pawlak)
• 1998: Defend thesis. Data Mining process model
(Anita Wasilewska & C. Fernandez-Baizan)
• Since then:
– Data Bases Professor: Data bases, data mining
– Coordinator of the Data Mining group at Facultad de
Informática UPM
• Techniques: Rough Sets, Bayes, …
• Methodologies for data mining process management
– Evaluation in Data Mining
– Experimentation in Web Mining
• Web Mining: Web Goal Mining
• Projects developed:
– Pure Research:
• Data Mining to be integrated on RDBMS
• Web Profiler
• Methodology for Data Mining process management
– Research and application:
• Data Mining applied on different domains:
– Car dealers
– Travel agency
– ….
• Methodologies for Data Mining project development
– Is it really Data Mining a Science?
– Are we developing proyects as an art?
– Has the research got the same results in all the areas??
• Algorithms
• Data Preparation
• Data enrichment
• Conceptualization of Data Mining problems
• Since it appeared a lot of algorithms have been programmed
• Standards:
– Crisp-DM
– SEMMA
– PMML 3.0
• Process depends on the expertise of the data miner
• User speaks about business problems
• Data Miner speaks about algorithms
• Data Mining is data intensive activity
– Data understanding
– Data Preparation
• Database manager:
– Transactional databases
– Datawarehouses
• The end result of a data mining project is a tool
(software project) for better decision making process:
– Software development project
• IT department has to be involved
• Why?
– In order to organize the process of develpoment and to produce a project plan
• How?
•
• Establish how the process is going to be develop:
– Sequential
– Incremental
LIFECYCLE
MODELS
•
Way of making things
•
Independent of the
What?
process being developed
• Establish how is the process is splitted into phases and define the tasks to be developed in each step:
– RUP
– XP
– COMMONKADS
METHODOLOGY
• Particular tasks
• Detail of tasks to be developed
• The common pitfall of data mining implementation the following:
– Not being able to efficiently communicate mining results within an organization.
– Not having the right data to conduct effective analysis.
– Not using existing data correctly.
– Not being able to evaluate results
• Questions that arise:
– Can the adequateness of a set of data for a problem be established when preparing the project plan?
– How the set of data can be used to produce the expected results?
– How we can evaluate the results?
– Cost estimation?
• Vendor independent:
– CRISP-DM
• Based on the commercial tools:
– CAT’s
– SEMMA
• CRM Methodology:
– CRM Catalyst
Model Process
Not Real Methodology
Based on Crisp-DM
Globlal CRM process
Does not concentrate on
Data Mining step
• CATs :
Clementine Application Templates
: [CATs]
– Specific libraries of best practices that provide inmediate value right out of the box
– Following the CRISP-DM standard. Every CAT stream is assigned to a CRISP-DM phase
– They provide long term value as they can always be used with a new data set for new insight in other projects.
• Available as an add-on module to Clementine, include:
– Telco CAT - improve retention and cross-selling efforts for telecommunications
– CRM CAT - understand and predict customer migration between segments,
– Microarray CAT - accelerate biological discoveries, find genes Fraud CAT - predict and detect instances of fraud in financial transactions, claims, tax returns …
– Web CAT
• SEMMA (
Sample, Explore, Modify, Model, Assess
):
[SEMMA]
– Is not a data mining methodology
– Rather a logical organization of the functional tool set of
SAS Enterprise Miner for carrying out the core tasks of data mining.
– Enterprise Miner can be used as part of any iterative data mining methodology adopted by the client.
– Naturally steps such as formulating a well defined business or research problem and assembling quality representative data sources are critical to the overall success of any data mining project.
• SEMMA is focused on the model development aspects of data mining:[SEMMA]
– Sample the data to extract a portion of a large data set big enough to contein significant information, yet small to manipulate quickly.
– Explore the data by searching for anticipated trends and anomalies in order to gain understanding and ideas.
– Modify the data by creating selecting and transforming the variables to focus the model selection problem.
– Model the data allowing the software to search automatically for a combination of data that reliably predicts a desired outcome.
Modelling techniques include neural networks, tree-clasiffiers, statistical models, etc.
– Assess the data by evaluating the usefulness and reliability of the findings from the data mining process and estimate how well it performs.
• Developed jointly by CustomISe, MACS and SalesPathways.
Together they have formed the Catalyst Foundation http://www.crmmethodology.com/
Motivations:
• CRM projects are difficult to execute successfully because of the wide range of factors influencing their success. So it can take a long time to make CRM work properly for an organisation.
• Solution: CRM Catalyst.
• Methodology acts as a catalyst for CRM projects enabling them to achieve their objectives more reliably and in less time.
• It gives a project life cycle with a set of defined phases broken down into steps with clearly stated inputs and outputs.
Implementation requires
Data Mining development process
The resutls are obtained in a progressive way
Progressive
Lifecycle Model
Implementation is
Knowledge intensive
In some steps
Knowledge
Intensive
Methdology could be appropriate
1. Define the goals:
– Business and data mining experts together have to define the goals
– Each goal must be defined with measurements for success
2. Obtain the models:
– Apply data mining algorithms.
– Preprocesing is important
3. Evaluate results:
– ascertaine the value of an object according to specified criteria, operationalised in terms of measures.
4. Deploy:
– Decide patterns and models that can be deployed
5. Evaluate
– After product working it should be contrasted the result
• Distinguish between :
– Data Mining goals
– Business goals
• How do we translate?
Increase the lifetime value of valuable customers
¿?
¿?
¿?
Clasification Estimation Association
It has to be solved in the Business
Understanding step of CRISP-DM
Business
Understanding
Determine
Business
Objectives
Assess
Situation
Determine
Data Mining
Goals
Produce
Project Plan
Background
Business
Objectives
Business
Success
Criteria
Inventory &
Resources
Reqs,
Assumptions
&Constraints
Risks &
Contingencies
Terminology
Data Mining
Goals
Data Mining
Success Criteria
Project Plan
Initial Assessment of Tools
& Techniques
Costs &
Benefits
• Not only business objectives have to be established but measures in order to be able to evaluate the results
• Business objectives:
– What is the customer's primary objective?
• Increase the number of loyal customers
• Selling more of a certain product
• Have a positive marketing campaing
• Business success criteria:
– What constitutes a successful outcome of the project?
– Objectives measures so that the success can be established
– ROI
• Perform a cost-benefits analysis
• Compute the benefits of the project
– Which measures do we have?
– ROI
– APEX
– OPEX....
• Compute the costs of the project (equipment, human resources...)
– Which methodology do we have?
– COCOMO for sortware
• Quantify the risk that the project fails
– Knowledge not available
– Data Not available
– Proper tools
• Establishing a parametrical estimation model for Data
Mining (Marban’03)
• Main factors in a Data Mining project
– Data Sources (number, kind, nature, …)
– Data mining problem to be solved (descriptive, predictive, …)
– Development platform
– Available tools
– Expertise of the development team
Data Drivers
Model Drivers
Platform Drivers
Tools and techniques Drivers
Project Drivers
People Drivers
Data mining goals:
– Translate the customer's primary objective into a data mining goal, e.g.
• Loyalty program translated into segmentation problem
• Decreasing the attrition rate transformed into classification problem
• Data mining success criteria:
– Determine success in technical terms
• Translate the notion of sucess into confidence, support and lift and other parameteres
• Determine de cost of errors
• How do we make the translation?
• Which is the methodology to be followed to translate business objectives into data mining objectives?
• Unluckily, there is no such methodology. First we have to solve:
– How a business objective is expressed?
– What is a data mining goal?
– How are data mining goals achieved?
– Which are the requirements of data mining functions?
In order to describe everything in a standard way:
• Data Bases:
– E/R diagrams
– Independent of the domain
– A tool for business understanding and for data base designer
– Translation from E/R to implementation
External view
1
External view n
Conceptual Schema
Internal Schema
Business problem
Business problem
Conceptual Schema
Requirements of algorithms will be solved at this level
Internal Schema
Tools requirements to be solved
SAS, WEKA, Clementine…
• It is the bridge:
– Between business goals and the final tool
– Independent of the domain
• Provides independence:
– Changes in the tool do not reflect to the solution
• It has to be decided what to model in the conceptualization
• Automatic translation of business goals into data mining goals
• Data Mining goals +constraints = feasible data mining goals
• Elements to be taken into account:
– Data:
• Quality from data mining point of view
• Adequateness for the problem
• Classification for data mining purposes
– Knowledge:
• Related to the process being analyzed
• Related to the data used
– People
• Owners of data
• Experts in the process
– Data mining problems requirements
– Data mining methods requirements
• Data Mining Modelling Objects:
– Data
– Knowledge
– Constraints of data and applications
– Data Mining objects
• Algorithms
• Measures
• Methods
• To bridge the gap between data miners and business users
• The adequateness of the data is analyzed taking into account goals to fulfil.
• Data together with the knowledge extracted from the experts can be transformed so that just by being the input of a certain data mining algorithm will produce the required patterns.
• Quality of the data, in this context:
– is not only related to the technical quality: proper model, percentage of null values,
• but also has to do with:
– meaning of the attributes,
– Where each piece of data comes from,
– relationship among data, and
– finally how the data fulfil the requirements of the data mining functions
• Apply data mining process model
• Associated problems solved by the 3 layers architecture:
– Comparison of approaches
– Evaluate costs
– Pros and cons of approaches
• Only experience or a conceptualization can help
• The conceptual model will help to establish the process to obtain each feasible model.
• Requirements and transformations implicit in the model
– What are data mining problems?
• Classification
• Estimation
• Association
• Segmentation
– In the conceptual model requirements for each type will be settled
– Data Mining problem has to be settled before going into modeling step
– Requierements will be established in Business understanding
– Requierements will be checked in Data Understanding and data Preparation
– Preparation will be guided by conceptual model
– Evaluation on feasibility can be done before applying the model
Business
Understanding
Data
Understa nding
Data
Prep arati on
M o d el in g
Eval uati on
Deplo yment
[Spilipopou, Berendt]
• Evaluation: the act of ascertaining the value of an object according to specified criteria , operationalised in terms of measures .
• Object= model already obtained
• Criteria and Measures and has to do with goals
• Evaluation requires a well-defined notion of success , which must be in place before
– the evaluation takes place
– the data mining phase starts
– any work with the data starts
• i.e. already during the business understanding process.
• Here once again conceptualization plays its role
• The CRISP-DM process is
– a non-ending circle of iterations
– a non-sequential process, where backtracking at previous phases is usually necessary
Business
Understanding
•
• In each sequential instantiation evaluation takes place:
Data
Understa nding
But it is a cycle
Data
Prep arati on
M o d el in g
Eval uati on
Deplo yment
• In all the iterations all the steps should be revisited
• Results have to be evaluated!!
• All the models that have possitive evaluation can be deployed
• For measurements of success to trust deployment has to follow rules established at the beginning of the project
– The real evaluation has not yet been performed
• After deployment there is the need to proof that the improvements are really due to the actions taken after a data mining discovery and not to any other factor or action carried out in the company
• None of the obvious claims about success of data mining have ever been systematically tested.
• Experiments are crucial to establish if the impact of the deployment is really positive or negative
• Experiments have to be designed at the beginning of the project
• Data mining projects are being developed more as art than a science
• Many algorithms have been implemented but no systematically proof of one better than another in real case is done after deployment
• Conceptual model is required:
– To map business goals to the model
– To map data mining algorithms to a conceptual model
• Achievements of the model:
– Will be used along the process to guide the project
– Evaluation tool
• Conceptual model
– Define DMMO objects
• Evaluation techniques related to the model:
– Evaluate data mining goals
– Evaluate business goals
• Experimentation methods:
– obstursively and
– non obstrusivelsly
• Evaluation in Web mining Tutorial at ECML/PKDD 2004 Pisa, Italy;
20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina
Menasalvas
• Towards a Methodology for Data mining Project Development : The
Importance of Abstraction. Menasalvas, Millán, Gonzalez-Aranda,
Segovia
• Bettina Berendt , Andreas Hotho , Dunja Mladenic , Maarten van
Someren , Myra Spiliopoulou, Gerd Stumme : Web Mining: From Web to Semantic Web, First European Web Mining Forum, EMWF 2003,
Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected and Invited Papers Springer 2004
• Myra Spiliopoulou, Carsten Pohle : Modelling and Incorporating
Background Knowledge in the Web Mining Process. Pattern Detection and Discovery 2002 : 154-169
• www.crisp-dm.org
• www.spss.com/clementine/cats.htm
• www.sas.com/technologies/analytics/datamining/miner/semma.html
• www.crmmethodology.com
• www.emetrics.org/articles/whitepaper.html
Facultad de Informática