Business Analytics Predictive Analytics: No Crystal Ball Required Steve Barbee, MS Data Mining, MS Plasma Physics IBM SPSS Predictive Analytics Specialist June 15, 2010 © 2010 IBM Corporation Business Analytics Contents What is Predictive Analytics? – Right Time, High Priority – Definitions – Disciplines – vs. Statistics – Datasets – vs. BI Methods What Does It Do, Where Is It Applied? – Questions It Answers – Application Areas – IBM’s Large Investment How Does It Work? – Modeler Data Mining Workbench – Mining Methods – Text Mining – Training a Learning Machine – Breadth of Data – Scoring Large Datasets How Do You Teach It? – Hot Jobs – Disciplines – Curriculum – Textbooks © 2010 IBM Corporation Business Analytics The Time is Still Right for Analytics • Executives are looking for new sources of advantage and differentiation • They have more data about their businesses than ever before • A new generation of technically literate executives is coming into organizations • The ability to make sense of data through computers and software has finally come of age Tom Davenport & Jeanne Harris, Competing on Analytics, p.11 BI/Analytics #1 investment to improve competitiveness Source: IBM Global CIO Study 2009; n = 2345 Top Four of the Ten Most Important Visionary Plan Elements Interviewed CIOs could select as many as they wanted © 2010 IBM Corporation Business Analytics What is Data Mining? “…the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules” -- Berry & Linoff* “…the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” --Gartner Group “Predictive analytics is a set of business intelligence technologies that uncovers relationships and patterns within large volumes of data that can be used to predict behavior and events.” -- TDWI Research** * From Data Mining Techniques: For Marketing, Sales & Customer Support, Michael J.A. Berry & Gordon LInoff, p.5 ** “Predictive Analytics,” What Works in Data Integration, TDWI Research, Vol.23, 2007, p.49 © 2010 IBM Corporation Business Analytics Some Fields Contributing To Data Mining Batch & OLAP reports Relational Data Model Data Warehousing Association Rules Databases Neural Networks ML Perceptron Machine Learning Genetic Algorithm Kohonen SOM Decision Tree Artificial Intelligence Information Retrieval Similarity Measures Clustering SMART IR systems Statistics Bayes (Naïve & Nets) Regression analysis Linear classification EM algorithm Maximum Likelihood Estimate Resampling, Jackknife, Bias reduction Exploratory data analysis K-Means clustering © 2010 IBM Corporation Based on Data Mining: Intro. & Adv. Topics, Margaret H. Dunham, p.13 Business Analytics Range of Records and Variables in Data Mining 10 Common Logarithm of Number of Records 9 8 Narrow 7 & Deep Retail sales 6 5 4 Semiconductor Manufacturing 3 2 Wide & Proteomics 1 Genomics Shallow 0 0 1 2 3 4 5 6 7 Common Logarithm of Number of Variables Modified from S. Barbee thesis: http://web.ccsu.edu/datamining/data%20mining%20theses/steve%20barbee%20thesis1905.pdf / © 2010 IBM Corporation Business Analytics Time To Change the 2 Cultures* Clash Top-Down Approaches: Bottom-Up Approaches: Query, Search Data Mining, Text Mining A Statistical Approach can involve a user forming a theory about a possible relationship in a database and converting that to a hypothesis and testing that hypothesis using a statistical method. It is a manual, user-driven, top-down approach to data analysis. The difference with data mining (which includes multivariate statistical models!) is that the interrogation of the data is done by the data mining method--rather than by the user. It is a data-driven, self-organizing, bottom-up approach to data analysis Source DM Review Statisticians can use their favorite methods from within Modeler 14 and Data Miners can broaden their capabilities by invoking statistical methods from Statistics 18 * "Statistical Modeling: The Two Cultures," Leo Breiman, Statistical Science, 2001, Vol.16 (3), pp.199-231. © 2010 IBM Corporation Business Analytics The Kinds of Questions that Data Mining Can Answer • Based on the percussion beat, what genre of music is this? • Which books of the New Testament have the same author? • What class of astronomical object is this image? • Which genes express when drug B prevents the rejection of a transplanted organ? • Which transformer in a grid is likely to fail due to a breakdown of its dielectric? • What combination of repair parts are needed at worldwide aircraft service centers? • To which of 4 products will a customer respond in a marketing campaign? • How much of a costume should store # 7005 stock for Halloween this year? • Which annuity holder will prematurely surrender their policy? • Which physician will prescribe more of this acid reflux drug than an alternative? © 2010 IBM Corporation Business Analytics Application Areas Neonatal Care Trading Advantage Environment Law Enforcement Radio Astronomy Telecom Manufacturing Smart Traffic Fraud Prevention © 2010 IBM Corporation Business Analytics IBM is Investing to Accelerate an Information-Led Transformation Over $12B in software investments since 2005 Over 4,000 Dedicated Consultants Analytics in a Box to Accelerate Time to Value Largest Math Department in Private Industry “IBM, not SAP or Oracle, is now the industry's premo analytics solution/platform vendor…” © 2010 IBM Corporation 10 Business Analytics Some Business Analytics Methods Compared Query/Reporting • Hypothesis-driven • Manual Data Mining • Data- & Goal-driven • Creates Hypotheses • Automatic Training • Hypothesis-driven • Manual OLAP Rule 3 for ‘Athlete Qualified’: ‘Which training regimen increases the lactate threshold the most? Diet ‘Drill down Training = 5 and Diet = 4 and VO2 = 9th decile Reports & Graphs VO2 Max > 5th decile and Interval Training Regiment in {15, 7-10} results in 100% Qualified for 83 athletes Scoring Model © 2010 IBM Corporation Business Analytics IBM Analytics Landscape Competitive Edge Optimization Predictive Analytics Simulation, Alerts Querying, Reporting, OLAP Complexity Based on: Competing on Analytics, Davenport and Harris, 2007 © 2010 IBM Corporation Business Analytics IBM SPSS Product Areas © 2010 IBM Corporation Business Analytics SPSS Modeler Capabilities • Easy to Learn / Visual Design Paradigm • Visual approach - no writing code! • Comprehensive range of data mining methods • Powerful Automated modeling • Automatically prepares data • Automatically finds the best model • Mines text, web & survey data • Fully integrated with Statistics • Open & Scalable architecture • No proprietary database required • Leverage your existing IT investment • Scales to enterprise volumes with SQL pushback in-database scoring © 2010 IBM Corporation Business Analytics Mining Methods in IBM SPSS Modeler 14 Data Preparation Dimension Reduction: – Feature Selection – Principal Components Analysis – Factor Analysis Classification and Regression Naïve Bayes Bayesian Networks Trees: – CHAID – C5.0 – C&RT – QUEST Neural Networks – Multi-Layer Perceptron – Radial Basis Functions Regression – Binomial, Multinomial Logistic – Multiple, Multivariate Linear Generalized Linear Model Discriminant Analysis SVM (Support Vector Machine) Segmentation and Anomaly Detection Clustering: – K-Means – Kohonen Self-Organizing Maps – 2-Step (based on BIRCH) Forecasting & Survival Analysis Time Series (ARIMA**) Cox Regression Market Basket & Sequence Analysis Association Rules: – A Priori – GRI – CARMA Case-Based Reasoning KNN – K Nearest Neighbor © 2010 IBM Corporation Business Analytics Getting Closer to 360-degree Customer View: Demographics Data Web Data Text Mining: Comments Customer Usage Data © 2010 IBM Corporation Business Analytics Predict: SPSS Text Analytics Leverages unstructured data via call center notes, blogs, web pages, open ended surveys etc. to improve predictive model accuracy Extracts concepts from text and can categorize them as sentiments Strong visualization capabilities enable quick understanding of business issues Page 17 © 2010 IBM Corporation Business Analytics Classification and Regression Require a Target Field Text Analytics adds columns such as the number of calls categorized as a Negative Billing Sentiment Inputs and a Target Neg Billg © 2010 IBM Corporation Business Analytics Mining Methods “Learn” from Data Customer Notes Text Mining (Category = T or F) Customer Database Survey/demographic (Satisfaction = 1—4 ) Web page hits Web Mining (Event = Y or N) Merged Data 2/3 1/3 Data To Test Model Data To Train New Data Learning method Predictive Model Scored Predictions © 2010 IBM Corporation Business Analytics Steps in the Data Mining Process Understand Prepare Connect to data sources Parse Trx by Mo. Aggregate call data Merge (plan & ID) Actions, Attitudes, Attributes Data exploration Transactions, 3rd Party, Surveys Subdivide by region, plans, etc. Anomaly detection Model Evaluate Define Target & Train Method Test Method Transform log Trx Binary, hi trend Feature selection Gains, accuracy, AUROC, Profit, Contingency matrix Trees, Neural Networks, Regressions, SVM, Bayesian Network Deploy Predict on new data Export Results, Model Sales strategy © 2010 IBM Corporation Business Analytics Automated Data Mining Scoring Process Score the Model on New Data in Your Database Build a Geographic Crime Predictive Model 21 © 2009 SPSS Inc. Deploy a Map of Hot Spots in the Field © 2010 IBM Corporation Business Analytics Should I Teach Data Mining Skills in My Department? In addition, as the U.S. business environment becomes increasingly competitive and organizations strive to increase efficiency and reduce costs through the use of information technology, computer and mathematical science occupations will see strong employment growth.“ -- 2008—2018 Outlook in Monthly Labor Review, Nov. 2009, p.83 Hot Careers for College Graduates 2010 A Special Report for Recent and Mid-Career College Graduates UC San Diego Extension, May 2010 1. Health Information Technology 2. Clinical Trials Design and Management for Oncology 3. Data Mining 4. Embedded Engineering 5. Feature Writing for the Web 6. Geriatric Health Care 7. Mobile Media 8. Occupational Health and Safety 9. Spanish/English Translation and Interpretation 10. Sustainable Business Practices and the Greening of all Jobs 11. Teaching Adult Learners 12. Teaching English as a Foreign Language 13. Marine Biodiversity and Conservation 14. Health Law © 2010 IBM Corporation Business Analytics A Sampling of Academic Disciplines Impacted by Data Mining – A Method of Obtaining Knowledge Empirically Arts Music Language, Linguistics Writing / Communications Political Science / Government Crime Public Safety Election Campaigning Law Tax Fraud Legal Documents Education Admissions Retention Performance Physical Education Athletic Performance Engineering Management Utilities Petrochemical Yield & Reliability Science Astronomy Material Science Medicine Genomic and Proteomic Analysis Biomarkers Diagnosis © 2010 IBM Corporation Business Analytics How Do You Teach It? I. Foundations 1. Intro 2. Data Preprocessing 3. Data Warehousing and OLAP for Data Mining 4. Association, correlation and frequent pattern analysis 5. Classification 6. Cluster and Outlier Analysis 7. Mining Time-Series and Sequence Data 8. Text Mining and Web Mining 9. Visual Data Mining 10. Data Mining: Industry efforts and social impacts II. Advanced Topics 1. Advanced Data Preprocessing 2. Data Warehousing, OLAP, Data Generalization http://www.sigkdd.org/curriculum/CURMay06.pdf 3. Advanced association, correlation and frequent pattern analysis 4. Advanced Classification 5. Advanced cluster analysis 6. Advanced Time-Series and Sequential Data Mining 7. Mining Data Streams 8. Mining Spatial, Spatiotemporal and Multimedia data 9. Mining Biological Data 10. Text Mining 11. Hypertext and Web mining 12. Data Mining Languages 13. Data Mining Applications 14. Data Mining and Society 15. Trends in Data Mining © 2010 IBM Corporation Business Analytics Textbooks Hastie, Tibshirani & Friedman Statistical DIFFICULTY Han, Kamber & Pei Business Nisbet, Elder & Miner Mitchell Witten & Frank Witten & Frank Larose Tan, Steinbach & Kumar Margaret Dunham Machine Learning Larose Practical S/W apps. Berry & Linoff © 2010 IBM Corporation Business Analytics For a copy of the presentation please e-mail: sbarbee@us.ibm.com © 2010 IBM Corporation