Data mining

advertisement
A Best Practices Framework for
Data Mining
Mark Tabladillo, Ph.D., Data Mining Scientist
Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor
13 June 2013 | Virtual Business Analytics Chapter
About MarkTab
Training and Consulting with
http://marktab.com
Data Mining Resources and Blog at
http://marktab.net
Ph.D. – Industrial Engineering, Georgia Tech
Training and consulting internationally across
many industries – SAS and Microsoft
Contributed to peer-reviewed research and
legislation
◦ Mentoring doctoral dissertations at the
accredited University of Phoenix
Presenter
About Artus
Assistant Professor for Analytic Information
Systems and Business Intelligence
PhD in computer science
Research: data mining for e-commerce and
mobile business
Consultant
Section One
DATA MINING FOUNDATION
4
Definition 1
(Informal)
Data mining is the automated or semi-automated process
of discovering patterns in data.
Definition 2
Data Mining is a process using
1. Exploratory Data Analysis
Statistical and visual data analysis techniques.
 Forming a hypothesis
2. Data Modeling & Predictions
Describe data using probability distributions and
Machine Learning algorithms (“model”).
 Fitting a hypothesis
3. Statistical Learning Theory
Model selection, model evaluation
6
Data Mining Visualized
f(
)
Input
Target
Target: attribute we are interested in.
Input: data available for our predictions.
Function f: describes the relationship between target and input.
Regrettably, f is unknown and unknowable.
7
Data Mining Visualized
Real world:
Unknown f (
)
Input
Target
Data Mining model:
Hypothesis h (
Need to find “good” h.
h is your DM “algorithm”.
)
Input data has to be appropriate.
Select and transform as needed
Correct modeling of
target is crucial
8
Top 10 Expectations
BEST PRACTICE: LEARN FROM EXPERIENCE
9
Expectation Ten
Marketing
More Scientific
•People can start data •Better models come
mining in 10 minutes… from days, weeks or
months of iterative
improvement
10
Expectation Nine
Marketing
More Scientific
•Data miners can
•Knowing the industry
provide provably good and organizational
models with little or
goals helps orient the
zero knowledge of the questions, modeling,
specific industry…
and analysis.
11
Expectation Eight
Marketing
•Open source software
can provide quality
results worthy of peerreviewed literature…
More Scientific
•Commercial software
with years-long service
options is required for
enterprise scale.
12
Expectation Seven
Marketing
•We can learn a lot from
the current data
warehouses, cubes,
and big data…
More Scientific
•We can improve our
modeling by creating
new data collection
strategies.
13
Expectation Six
Marketing
•People can build data
mining models with
little or zero data
cleaning…
More Scientific
•Better results happen
when we organize and
rearrange data for best
success.
14
Expectation Five
Marketing
•Data mining can
provide answers to
problems…
More Scientific
•Most times we only get
detail insights toward
larger problems, and
sometimes uncover
more problems than
we started with.
15
Expectation Four
Marketing
•A little data mining
knowledge can provide
an organization with a
competitive edge…
More Scientific
•The edge grows along
with experience and
better study of the
methodology and
mathematics.
16
Expectation Three
Marketing
•Individual
professionals can
deliver excellent
predictive analysis…
More Scientific
•Small teams working
together can help
quickly and efficiently
conquer some of the
most difficult analytic
challenges.
17
Expectation Two
Marketing
•Numbers speak for
themselves and can
influence better
decision making…
More Scientific
•Leadership strategy
helps teams deliver
results in the best way
given the current
culture.
18
Expectation One
Marketing
•A lot of data mining
best practices and
strategies can be
communicated in an
hour or a day…
More Scientific
•The best commitment
is ongoing education
on both data mining
and machine learning
technology.
19
Section Two
ANALYZING AND PREPARING DATA
20
Best practice: study individual attributes
Histograms and frequencies (discrete)
Kernel density estimates
Cumulative distribution function
Rank-order plots and lift charts
Summary statistics (continuous)
Box-and-whisker plots
21
Best practice: study combinations
Pivot tables
Scatter plots
Logarithmic plots
Naïve Bayes
Correlation matrices
False-Color plots
Scatter-Plot matrix
Co-plot
22
Section Three
MACHINE LEARNING ALGORITHMS
23
How to Choose an Algorithm
Choosing an algorithm or series of algorithms is an art
One algorithm could perform different tasks
Be willing to experiment with algorithms and algorithm parameters
24
Algorithms for Data Mining Tasks (1 of 2)
Algorithm
Description
Name
Microsoft Time Analyzes time-related data by using a linear decision tree.
Series
Patterns can be used to predict future values in the time series.
Microsoft
Makes predictions based on the relationships between columns in the dataset, and models
Decision Trees the relationships as a tree-like series of splits on specific values.
Supports the prediction of both discrete and continuous attributes.
Microsoft
Linear
Regression
If there is a linear dependency between the target variable and the variables being
examined, finds the most efficient relationship between the target and its inputs.
Supports prediction of continuous attributes.
Microsoft
Clustering
Identifies relationships in a dataset that you might not logically derive through casual
observation. Uses iterative techniques to group records into clusters that contain similar
characteristics.
Algorithms for Data Mining Tasks (2 of 2)
Algorithm Name
Description
Microsoft Naïve
Bayes
Finds the probability of the relationship between all input and predictable columns. This algorithm is
useful for quickly generating mining models to discover relationships.
Supports only discrete or discretized attributes.
Treats all input attributes as independent.
Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values,
usually the occurrence or non-occurrence of an event.
Supports the prediction of both discrete and continuous attributes.
Microsoft Logistic
Regression
Microsoft Neural
Network
Microsoft
Association Rules
Microsoft
Sequence
Clustering
Analyzes complex input data or business problems for which a significant quantity of training data is
available but for which rules cannot be easily derived by using other algorithms.
Can predict multiple attributes.
Can be used to classify discrete attributes and regression of continuous attributes.
Builds rules that describe which items are likely to appear together in a transaction.
Identifies clusters of similarly ordered events in a sequence.
Provides a combination of sequence analysis and clustering.
Best practice: Document your science
Describe the business problem
Determine how to measure success (including baseline)
Document what was learned during data preparation and analysis
Justify the algorithms used during the investigation
List assumptions were made
27
Section Four
ACHIEVING BUSINESS VALUE
28
Leadership challenges
Build on organizational communications
Consider redoing analysis
Find results champions
Celebrate the results
29
Best practice: prepare the next cycle
Note strengths, weaknesses, opportunities, risks
Build consensus on model expiration dates
Encourage and improve the process
Create insight into new future data collection
30
Conclusion
Best Practices Framework
Provide a data mining foundation
Prepare the data
Evaluate machine learning output
Plan to move toward actionable decisions
31
Resources
http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libs
http://www.enthought.com/products/epd.php Commercial Python
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorial
http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data
Mining
http://marktab.net Data Mining Portal
http://sqlserverdatamining.com Data Mining Team Portal
Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical
Time Series Forecasting”
32
Download