Predictive Analytics - Enterprise Computing Community

advertisement
Business Analytics
Predictive Analytics:
No Crystal Ball Required
Steve Barbee, MS Data Mining, MS Plasma Physics
IBM SPSS Predictive Analytics Specialist
June 15, 2010
© 2010 IBM Corporation
Business Analytics
Contents
 What is Predictive Analytics?
– Right Time, High Priority
– Definitions
– Disciplines
– vs. Statistics
– Datasets
– vs. BI Methods
 What Does It Do, Where Is It Applied?
– Questions It Answers
– Application Areas
– IBM’s Large Investment
 How Does It Work?
– Modeler Data Mining Workbench
– Mining Methods
– Text Mining
– Training a Learning Machine
– Breadth of Data
– Scoring Large Datasets
 How Do You Teach It?
– Hot Jobs
– Disciplines
– Curriculum
– Textbooks
© 2010 IBM Corporation
Business Analytics
The Time is Still Right for Analytics
• Executives are looking for new sources of advantage and differentiation
• They have more data about their businesses than ever before
• A new generation of technically literate executives is coming into
organizations
• The ability to make sense of data through computers and software has
finally come of age
Tom Davenport & Jeanne Harris, Competing on Analytics, p.11
BI/Analytics #1
investment to
improve
competitiveness
Source: IBM Global CIO Study 2009; n = 2345
Top Four of the Ten Most Important Visionary Plan Elements
Interviewed CIOs could select as many as they wanted
© 2010 IBM Corporation
Business Analytics
What is Data Mining?
 “…the exploration and analysis, by automatic or semiautomatic
means, of large quantities of data in order to discover meaningful
patterns and rules” -- Berry & Linoff*
 “…the process of discovering meaningful new correlations,
patterns and trends by sifting through large amounts of data stored
in repositories, using pattern recognition technologies as well as
statistical and mathematical techniques.” --Gartner Group
 “Predictive analytics is a set of business intelligence
technologies that uncovers relationships and patterns within large
volumes of data that can be used to predict behavior and events.”
-- TDWI Research**
* From Data Mining Techniques: For Marketing, Sales & Customer Support, Michael J.A. Berry & Gordon LInoff, p.5
** “Predictive Analytics,” What Works in Data Integration, TDWI Research, Vol.23, 2007, p.49
© 2010 IBM Corporation
Business Analytics
Some Fields Contributing To Data Mining
Batch & OLAP reports
Relational Data Model
Data Warehousing
Association Rules
Databases
Neural Networks
ML Perceptron
Machine Learning
Genetic Algorithm
Kohonen SOM
Decision Tree
Artificial
Intelligence
Information
Retrieval
Similarity Measures
Clustering
SMART IR systems
Statistics
Bayes (Naïve & Nets)
Regression analysis
Linear classification
EM algorithm
Maximum Likelihood Estimate
Resampling, Jackknife, Bias reduction
Exploratory data analysis
K-Means clustering
© 2010 IBM Corporation
Based on Data Mining: Intro. & Adv. Topics, Margaret H. Dunham, p.13
Business Analytics
Range of Records and Variables in Data Mining
10
Common Logarithm of Number of Records
9
8
Narrow
7
& Deep
Retail
sales
6
5
4
Semiconductor
Manufacturing
3
2
Wide & Proteomics
1
Genomics
Shallow
0
0
1
2
3
4
5
6
7
Common Logarithm of Number of Variables
Modified from S. Barbee thesis: http://web.ccsu.edu/datamining/data%20mining%20theses/steve%20barbee%20thesis1905.pdf /
© 2010 IBM Corporation
Business Analytics
Time To Change the 2 Cultures* Clash
Top-Down Approaches:
Bottom-Up Approaches:
Query, Search
Data Mining, Text Mining
 A Statistical Approach can
involve a user forming a theory
about a possible relationship in
a database and converting that
to a hypothesis and testing
that hypothesis using a
statistical method. It is a
manual, user-driven, top-down
approach to data analysis.

The difference with data
mining (which includes
multivariate statistical models!)
is that the interrogation of the
data is done by the data
mining method--rather than by
the user. It is a data-driven,
self-organizing, bottom-up
approach to data analysis
 Source DM Review
Statisticians can use their favorite methods from within Modeler 14 and Data Miners
can broaden their capabilities by invoking statistical methods from Statistics 18
* "Statistical Modeling: The Two Cultures," Leo Breiman, Statistical Science, 2001, Vol.16 (3), pp.199-231.
© 2010 IBM Corporation
Business Analytics
The Kinds of Questions that Data Mining Can Answer
• Based on the percussion beat, what genre of music is this?
• Which books of the New Testament have the same author?
• What class of astronomical object is this image?
• Which genes express when drug B prevents the rejection of a transplanted organ?
• Which transformer in a grid is likely to fail due to a breakdown of its dielectric?
• What combination of repair parts are needed at worldwide aircraft service centers?
• To which of 4 products will a customer respond in a marketing campaign?
• How much of a costume should store # 7005 stock for Halloween this year?
• Which annuity holder will prematurely surrender their policy?
• Which physician will prescribe more of this acid reflux drug than an alternative?
© 2010 IBM Corporation
Business Analytics
Application Areas
Neonatal Care
Trading Advantage
Environment
Law Enforcement
Radio Astronomy
Telecom
Manufacturing
Smart Traffic
Fraud Prevention
© 2010 IBM Corporation
Business Analytics
IBM is Investing to Accelerate an Information-Led
Transformation

Over $12B in software
investments since 2005

Over 4,000
Dedicated Consultants

Analytics in a Box to
Accelerate Time to Value

Largest Math Department in
Private Industry
“IBM, not SAP or Oracle, is now the
industry's premo analytics
solution/platform vendor…”
© 2010 IBM Corporation
10
Business Analytics
Some Business Analytics Methods Compared
Query/Reporting
• Hypothesis-driven
• Manual
Data Mining
• Data- & Goal-driven
• Creates Hypotheses
• Automatic
Training
• Hypothesis-driven
• Manual
OLAP
Rule 3 for ‘Athlete Qualified’:
‘Which training regimen
increases the lactate
threshold the most?
Diet
‘Drill down Training = 5
and Diet = 4 and VO2 = 9th
decile
Reports &
Graphs
VO2 Max > 5th decile and
Interval Training Regiment in {15, 7-10}
results in 100% Qualified for 83
athletes
Scoring
Model
© 2010 IBM Corporation
Business Analytics
IBM Analytics Landscape
Competitive Edge
Optimization
Predictive Analytics
Simulation, Alerts
Querying, Reporting, OLAP
Complexity
Based on: Competing on Analytics, Davenport and Harris, 2007
© 2010 IBM Corporation
Business Analytics
IBM SPSS Product Areas
© 2010 IBM Corporation
Business Analytics
SPSS Modeler Capabilities
• Easy to Learn / Visual Design Paradigm
• Visual approach - no writing code!
• Comprehensive range of data mining
methods
• Powerful Automated modeling
• Automatically prepares data
• Automatically finds the best model
• Mines text, web & survey data
• Fully integrated with Statistics
• Open & Scalable architecture
• No proprietary database required
• Leverage your existing IT investment
• Scales to enterprise volumes with SQL
pushback in-database scoring
© 2010 IBM Corporation
Business Analytics
Mining Methods in IBM SPSS Modeler 14
Data Preparation
 Dimension Reduction:
– Feature Selection
– Principal Components Analysis
– Factor Analysis
Classification and Regression
 Naïve Bayes
 Bayesian Networks
 Trees:
– CHAID
– C5.0
– C&RT
– QUEST
 Neural Networks
– Multi-Layer Perceptron
– Radial Basis Functions
 Regression
– Binomial, Multinomial Logistic
– Multiple, Multivariate Linear
 Generalized Linear Model
 Discriminant Analysis
 SVM (Support Vector Machine)
Segmentation and Anomaly Detection
 Clustering:
– K-Means
– Kohonen Self-Organizing Maps
– 2-Step (based on BIRCH)
Forecasting & Survival Analysis
 Time Series (ARIMA**)
 Cox Regression
Market Basket & Sequence Analysis
 Association Rules:
– A Priori
– GRI
– CARMA
Case-Based Reasoning
 KNN – K Nearest Neighbor
© 2010 IBM Corporation
Business Analytics
Getting Closer to 360-degree Customer View:
Demographics Data
Web Data
Text Mining: Comments
Customer
Usage Data
© 2010 IBM Corporation
Business Analytics
Predict: SPSS Text Analytics
 Leverages unstructured
data via call center
notes, blogs, web pages,
open ended surveys etc.
to improve predictive
model accuracy
 Extracts concepts from
text and can categorize
them as sentiments
 Strong visualization
capabilities enable quick
understanding of
business issues
Page 17
© 2010 IBM Corporation
Business Analytics
Classification and Regression Require a Target Field
Text Analytics adds columns such as
the number of calls categorized as a
Negative Billing Sentiment
Inputs
and a
Target
Neg Billg
© 2010 IBM Corporation
Business Analytics
Mining Methods “Learn” from Data
Customer Notes
Text Mining
(Category = T or F)
Customer Database
Survey/demographic
(Satisfaction = 1—4 )
Web page hits
Web Mining
(Event = Y or N)
Merged Data
2/3
1/3
Data To Test
Model
Data To Train
New Data
Learning method
Predictive Model
Scored Predictions
© 2010 IBM Corporation
Business Analytics
Steps in the Data Mining Process
Understand
Prepare
Connect
to data
sources
Parse Trx by Mo.
Aggregate call data
Merge (plan & ID)
Actions,
Attitudes,
Attributes
Data exploration
Transactions,
3rd Party,
Surveys
Subdivide by region, plans,
etc.
Anomaly detection
Model
Evaluate
Define Target
& Train Method
Test
Method
Transform log Trx
Binary, hi trend
Feature selection
Gains,
accuracy,
AUROC,
Profit,
Contingency
matrix
Trees, Neural
Networks,
Regressions,
SVM, Bayesian
Network
Deploy
Predict
on new
data
Export
Results,
Model
Sales
strategy
© 2010 IBM Corporation
Business Analytics
Automated Data Mining Scoring Process
Score the Model
on New Data in
Your Database
Build a Geographic
Crime Predictive Model
21
© 2009 SPSS Inc.
Deploy a Map of
Hot Spots in the
Field
© 2010 IBM Corporation
Business Analytics
Should I Teach Data Mining Skills in My Department?
In addition, as the U.S. business environment becomes increasingly competitive and
organizations strive to increase efficiency and reduce costs through the use of information
technology, computer and mathematical science occupations will see strong
employment growth.“ -- 2008—2018 Outlook in Monthly Labor Review, Nov. 2009, p.83
Hot Careers for College Graduates 2010
A Special Report for Recent and Mid-Career College Graduates
UC San Diego Extension, May 2010
1. Health Information Technology
2. Clinical Trials Design and Management
for Oncology
3. Data Mining
4. Embedded Engineering
5. Feature Writing for the Web
6. Geriatric Health Care
7. Mobile Media
8. Occupational Health and Safety
9. Spanish/English Translation and
Interpretation
10. Sustainable Business Practices and
the Greening of all Jobs
11. Teaching Adult Learners
12. Teaching English as a Foreign
Language
13. Marine Biodiversity and Conservation
14. Health Law
© 2010 IBM Corporation
Business Analytics
A Sampling of Academic Disciplines Impacted by Data Mining – A
Method of Obtaining Knowledge Empirically
Arts
Music
Language, Linguistics
Writing / Communications
Political Science / Government
Crime
Public Safety
Election Campaigning
Law
Tax Fraud
Legal Documents
Education
Admissions
Retention
Performance
Physical Education
Athletic Performance
Engineering Management
Utilities
Petrochemical
Yield & Reliability
Science
Astronomy
Material Science
Medicine
Genomic and Proteomic Analysis
Biomarkers
Diagnosis
© 2010 IBM Corporation
Business Analytics
How Do You Teach It?
I. Foundations
1. Intro
2. Data Preprocessing
3. Data Warehousing and OLAP for Data Mining
4. Association, correlation and frequent pattern analysis
5. Classification
6. Cluster and Outlier Analysis
7. Mining Time-Series and Sequence Data
8. Text Mining and Web Mining
9. Visual Data Mining
10. Data Mining: Industry efforts and social impacts
II. Advanced Topics
1. Advanced Data Preprocessing
2. Data Warehousing, OLAP, Data Generalization
http://www.sigkdd.org/curriculum/CURMay06.pdf
3. Advanced association, correlation and frequent pattern
analysis
4. Advanced Classification
5. Advanced cluster analysis
6. Advanced Time-Series and Sequential Data Mining
7. Mining Data Streams
8. Mining Spatial, Spatiotemporal and Multimedia data
9. Mining Biological Data
10. Text Mining
11. Hypertext and Web mining
12. Data Mining Languages
13. Data Mining Applications
14. Data Mining and Society
15. Trends in Data Mining
© 2010 IBM Corporation
Business Analytics
Textbooks
Hastie, Tibshirani
& Friedman
Statistical
DIFFICULTY
Han, Kamber & Pei
Business
Nisbet, Elder
& Miner
Mitchell
Witten & Frank
Witten & Frank
Larose
Tan, Steinbach
& Kumar
Margaret Dunham
Machine Learning
Larose
Practical S/W apps.
Berry & Linoff
© 2010 IBM Corporation
Business Analytics
For a copy of the presentation please e-mail:
sbarbee@us.ibm.com
© 2010 IBM Corporation
Download