Predictive Analytics in the Enterprise Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS, MCITP BI Quanta Intelligence Introduction Experience o o o 10 years Business Intelligence Consultant Data Warehouse, Integration, Data Mining, Predictive Models 15 years Software Development as Programmer thru CTO Speaker/Author • Speaker at PASS Summit in October • Local and Nationwide Events and Users Groups • Published Chapter on Big Data for Pro SQL Server 2012 Practices Contact Info o o o Email: cbossy@quantaintelligence.com Blog: www.carlosbossy.com Twitter: @carlosbossy Today’s Session Early Morning Late Morning Predictive Modeling Overview Tools and Algorithms SQL Server Data Mining After Lunch Use Models with DMX – C#, SSIS, SSRS Mid Afternoon Late Afternoon Demo – Text Mining with SVM Exercise 1 Demo - Neural Network Code Case Studies Predictive Model by Class Demo\Lab 1 Model Development Demo\Lab 2 – Model Development Exercise 2 Demo 3 – R, WEKA, RapidMiner Introduction Predictive Analytics Predictive Analytics Data Mining Predictive Modeling Math! Just a little…. Predictive Analytics Defined Predictive Analytics is the discovery of predictive information, sometimes hidden, from databases (often large) using related and unrelated data attributes by applying algorithmic analysis, and the creation of models that generate predictive results. Predictive Modeling is the process of creating a model to best predict the probability of an outcome. Data Scientist A Data Scientist incorporates techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data. Data science: a novel term used interchangeably with competitive intelligence or business analytics. Predictive Analytics Gartner Hype Cycle The End What do we want to Achieve? Model • Enterprise • Embedded • Demo Data Prediction Predictive Analytics A vital component in the BI Data Architecture and available in real-time to Applications, Reports, ETL Measures (past) Predict (unknown) ROI of Business Analytics Projects when Incorporating Predictive Analytics (Source: IDC) 145% vs 89% Yesterday <-----> Today Legacy o Manageable Volumes of Data Today o The ease and power of SQL o Domain Experts could grasp and analyze a complete Database o Limited CPU Horsepower o Finite Data Explosion of Data doubles every 3 years (Moore’s Law) Data Volumes can’t be comprehended by humans Uncover complex and difficult to find patterns for competitive edge Improve professional judgment of Domain Expert (small but valuable) Knowledge Discovery Converting Data to Information N --> Infinity Applications o o o o o o o o o Credit Risk Analysis Churn Analysis Customer Retention Targeted Marketing Market Basket Analysis Sales Forecasting Stock Predictions Medical Diagnosis Bioscience Research o o o o o o o o o Surveys Insurance Rate Quotes Credit Card Fraud Web Site Events Loan Applications Hiring and Recruiting Cross-Marketing Attendance Anything! Terms to Know Class Variable is the data we are predicting. Instance and Attribute is similar to a row and column used as input to a predictive model. Support refers to the number of instances of a particular pattern. Overfitting (Overtraining) is a flaw in a predictive model where the model reflects the training data too closely. Confusion Matrix (Classification Matrix) is a table showing predicted values and their accuracy. Cross Validation is a way to develop and test a model by using a percentage of the data to test and another . Modeling Methodology / Approach Methodology / Approach Problem Statement What do you want to predict? Is data available? Set a Target Improve on current results Develop Solution Integrate into Applications or Reports Methodology / Approach Gather Data Explore Data Monitor Prepare Data Deploy Training / Test Sets Test Train Choose Algorithms Event Management Model How many people will register for an Event? • • • Simultaneous events managed: > 100 Mean Error: 15% Target Error: 5% Pass 1: 60 days Pass 2: 30 days Pass 3: 7 days Medical Case Management Workplace Injuries Can a case benefit from Nurse Case Management intervention? 95% of Claims < $2500 1% > $10,000 Predict Cost at Intake Gather Data Predictive Analytics DB • Data Warehouse • OLTP • Cube • External • User Data Prepare Data • Transform the data to fit the Problem/Algorithm/Solutions • Get derivations from SMEs, Data Analysis, Groupings • Discrete vs. Continuous • Bins and Buckets (discretize) • Analyze Quantities: logarithmic, binary, linear • Moving Averages: simple, exponential, convergence, divergence • Input to Algorithm in Tabular Form Select Training and Testing Sets Training Testing Choose Algorithm(s) o Time Series o Random Forest o Naïve Bayes o Model Tree o Association o Genetic o Clustering o Support Vector Machine o Regression Tree o Many more … o Logistic Regression o Clustering o Sequence Clustering o Neural Networks Choose Algorithms Clustering Classification Regression Association Decision Tree Yearly Income = 64,387.383+5,984.161*(Cars-2.125)+9,372.900*(Age-60.313)+0.763*(ID17,759.125)+20,740.055*(Children-0.125) Model Training and Testing Apply Algorithms to Training Dataset Test Model using Test Dataset Iterate Model Training and Testing Predictive Model Applications SQL Server Data Mining SQL Server 2005 and 2008 • Analysis Services SQL Server 2012 • Analysis Service Multi-Dimensional Development • Business Intelligence Development Studio • Excel Data Mining Add-in Event Registrations Model Development 7 year record of every individual registration transaction Best Algorithm Decision Tree Daily DW ETL Predicts each Event and Stores Result Personal Business Account/Payment Other Events Transformed data to daily tabular snapshot for modeling Time Series Decision Trees Neural Networks Logistic Regression Defined Seasons, Moving Averages Achieved 3.5% Mean Error Rate (Goal 5%) Event Registrations Daily Snapshot in Tabular Form Event ID Season 1Spring 2Winter 3Spring 4Winter 5Winter Start Date 3/17/2010 2/5/2011 5/11/2010 11/14/2011 1/23/2011 End Date Initial Reg Date Early Fee Date 3/20/2010 10/15/2009 11/15/2009 2/7/2011 12/1/2010 5/14/2010 2/15/2010 3/1/2010 11/15/2011 9/1/2011 1/23/2011 11/1/2010 Person ID Event ID Company ID Job Title 1001 1 2510 CTO 1002 4 2010 Consultant 1003 2 2948 Architect 1004 1 3992 DBA 1005 4 2899 Consultant Date Event ID Registrations Reg Date Paid in Full 1/19/2010 Y 11/1/2011 N 12/18/2010 Y 2/15/2010 Y 10/21/2011 N Registrations To Date 5 Day Average 10 Day Average Discount Fee Date 1/1/2010 1/5/2011 3/31/2010 8/1/2011 10/14/2010 Standard Fee Date Early Fee Discount Fee Standard Fee Late Fee Attendance 3/15/2010 895.00 1095.00 1295.00 1395.00 1225 2/5/2011 995 1295 880 5/11/2010 1250 1495 1795 2410 11/7/2011 695 850 950 650 1/23/2011 295 395 350 Events Attended Travel Booked 3 N 0 N 1 Y 1 N 4 Y Group N Y N N N Average Events Attended Travel Booked Groups Paid in Full Predicted Reg Absolute Error Percentage Error 11/14/2009 1 9 110 4.0 2.9 94 1.4 45 8 1158 67 5.8% 11/15/2009 1 23 133 8.1 5.0 115 1.6 56 12 1201 24 2.0% 11/16/2009 1 2 135 8.0 4.9 116 1.6 56 12 1268 43 3.4% 11/17/2009 1 5 140 8.6 5.2 121 1.5 57 12 1251 26 2.1% 11/18/2009 1 4 144 9.0 5.1 125 1.5 60 13 1239 14 1.1% Event Registrations Predictive Model Outcome 30 Day Prediction Goal: 1425 - 1575 Before: 1275 - 1725 Event Attendance: 1500 After: 30 Day Prediction 1447 - 1553 More Terms to Know (Advanced) Ensemble modeling combines the results of two or more independent models to improve the quality of predictions. Boosting (Bootstrapping) is a form of ensemble modeling that trains subsequent models by heavily weighting poorly classified instances. Bagging is also a form of ensemble modeling that trains each model in the ensemble using a random subset of the training data. The models then vote to obtain the best prediction. Stacking trains a model by using as input the output of predictions of other models. Feature Selection is a process of selecting a subset of relevant data to reduce the amount of data a model must process to train. It can generalize a model and help avoid overfitting. Worker Injury Claims Data Unbalanced Objectives Target • 15 Year History • Closed Cases • 18 Month Lag • 1 out of 20 is high cost Case • Identify High Cost Cases • Rate of False Positives Unimportant • Need to avoid True Negatives • 85% success identifying high cost cases at Intake Workers’ Compensation Model Development Initial Model Successfully identified 79% of Cases Subsequent Development Balanced Data for Training Used for Reporting Ensemble Model Development 3 predictions for each Model Intake Allowance Non-RX Mean Probability of each model 10% Threshold RX Duration Claim Scoring Achieved 90% Success Rate Claim is scored anytime it is Updated Prescription 35% Trues Neural Network Prediction Probability Score Always Stored Worker Injury Claims Tabular Data View Case Person ID 1 1004 2 1001 3 1098 Injury Date Open Date Medical Date 3/17/2011 3/20/2011 3/22/2011 2/5/2011 2/5/2011 2/5/2011 8/19/2011 8/30/2011 9/12/2011 Person Company Employment ID ID Job Title Date Department 1001 1004 Safety Technician 1/23/2002 Maintenance 1002 1098 Assembly Supervisor 2/25/2005 Parts Dept 1003 1001 Custodian 12/2/2008 Maintenance Rx Date 3/29/2011 2/6/2011 9/15/2011 Duration Birth Date 10/13/1972 8/21/1968 10/13/1981 Gender M F M Body Part Injured 8 Shoulder 75 Ankle 89 Ribs Height 5.9 5.7 6.1 Injury Type Sprain Break Bruise Weight 175 140 190 More Terms and Things to Know Balancing of the training dataset is done to ensure class values occur in a similar proportion. Scoring is the output of a model (the predicted value). Model Disruption happens when an event occurs that causes a model to no longer make accurate predictions. Machine Learning is analogous to algorithmic predictive modeling. Expert Systems are developed using expertise and rules, not algorithms. Unstructured data such as text and images can be modeled to obtain information from the data. Working Session Group Model A Predictive Model for Prospective Hires Group Model Developing our Predictive Model 1) 2) 3) 4) 5) 6) 7) 8) Problem Statement Gather Data Explore Data Prepare Data Training / Test Sets Choose Algorithms Train Test Group Model Problem Statement Should we hire a job candidate? What salary should we offer to a potential employee? Group Model Where’s the Data? Group Model Building the Model Create Model Development Exercise #1 New Project - SSAS Multidimensional and Data Mining Project Set Properties - Server/Instance - Add Data Source with appropriate Impersonation Mode - Add Data Source View Mining Structure - Defines the data we will use - Data must have a key - Ensure data types and content attribute are correct for each column - Column usage must be declared (Key, Input, Predict, PredictOnly) Mining Model - One created through wizard - Create more mining models for each mining structure Process and Run - Builds Models Model Development Exercise #1 BI Studio (SQL Server Data Tools) Visualization • Viewer (different for each algorithm) • Diagrams • Mining Legend • Mining Model Viewer • Mining Accuracy Chart o Lift Chart (Compare Models to random guesses and ideal model and to each other) o Classification Matrix (Confusion Matrix) o Cross Validation (90-10) • Mining Model Prediction Model Development Exercise #1 Predict class of Iris plant from sepal length and width, petal length and width. Dataset: iris table Create: Mining Structure with 30% holdout Add Mining Models: • Decision Tree • Neural Network • Logistic Regression • Cluster Process / Test / Compare Model Development Exercise #1 Visualize and evaluate the quality of the models. Dataset: iris table Visuals - Mining Model Viewer - Mining Legend Evaluate - Classification Matrix - Cross Validation - Mining Model Prediction Model Development Exercise #1 Fine-tune the mining models by changing settings and parameters. Dataset: iris table Decision Tree - Change Minimum Support and view Classification Matrix - Explain Complexity Penalty for pruning and to prevent overfitting Cluster • Cluster Count • Minimum Support • Change cluster count to 3 • Add another cluster mining model with cluster count set to 0 • Add one more with cluster count = 0 and modeling cardinality = 50 Tools What Analytics, Big Data, Data mining, Data Science software have you used in the past 12 months for a real project? Software Rapid-I RapidMiner/RapidAnalytics Free R Excel Weka / Pentaho Python Rapid-I RapidAnalytics/RapidMiner Commercial SAS MATLAB StatSoft Statistica IBM SPSS Statistics Microsoft SQL Server Tableau IBM SPSS Modeler KNIME free edition SAS Enterprise Miner 2013 / 2012 39.2% / 26.7% 37.4 / 30.7 28.0 / 29.8 14.3 / 14.8 13.3 / 14.9 12.0 / 12.0 10.7 / 12.7 9.9 / 10.0 9.0 / 14.0 8.7 / 7.8 7.0 / 5.0 6.3 / 4.4 6.1 / 6.8 5.9 / 21.8 5.9 / 5.8 Tools – Excel Data Exploration and Preparation • • • • • • Explore Data (data profiling) Discretization of Continuous Data Identify\Exclude Outliers Re-label (rename\map columns) Random Sampling to reduce size of Training Set Over Sampling for Balancing Algorithm Differences Unsupervised: Cluster Algorithm Supervised: A class variable (prediction) is specified Classification: Predict a discrete variable Regression: Predict a continuous variable Time Series: Date driven predictions Association: Market Basket predictions Lunch 1 Hour Neural Network Algorithm New Weight = Weight + Error * Input * Learning Constant Multi-Layered…Feed Forward…Back Propagation Neural Network Demo Using C# Vegetable Classification If we know a vegetable is Orange, Green, Brown, Yellow, Small, Large, then is it a Pumpkin, Carrot, Zucchini, Yam, Potato? Data Preparation NFL Data – Game Average Team Yds Pass Rush Yds 1stD Yds G Pts 3rdM 3rdD% 4thM 4thD% Pen PYds TOP Denver Broncos 3 42.3 486.7 374.7 112.0 26.0 21 53.8 0 N/A 26 233 29:53 Green Bay Packers 3 32.0 454.7 326.7 128.0 26.0 12 36.4 1 33.3 13 139 28:35 Chicago Bears 3 31.7 330.7 225.7 105.0 18.7 18 43.9 2 100.0 11 119 31:13 Seattle Seahawks 3 28.7 379.7 247.0 132.7 21.7 15 38.5 1 50.0 23 207 33:11 Dallas Cowboys 3 27.7 348.3 242.7 105.7 21.0 13 35.1 0 N/A 16 139 34:08 Detroit Lions 3 27.3 410.7 336.0 74.7 22.7 12 32.4 1 33.3 27 260 31:16 Using a Model SSRS SSIS .NET Framework Common Mistakes Lack of Relevant Data - No history - Data doesn’t predict the desired class value - Poor quality data Common Mistakes Avoid Overfitting - High quality training set - Cross-validation - Pruning Increasing the Complexity Penalty helps avoid Overfitting. Model Development Exercise #2 Training Data Set: CollegePlans table Test Data Set: CollegePlansTest table Predict whether a student will attend college based on demographic and family characteristics. When complete, turn it into a new problem by predicting a continuous value. Purpose: Learn how to work with a test data set. Create Mining Structure with 0% holdout Add mining model using a Neural Network Process / Test Model Development Exercise #2 Test by making mining model predictions using DMX and test dataset. Use Predict and Predict Probability functions in DMX. Change Prediction: Add additional Neural Network mining model predicting IQ. Test continuous prediction of IQ. Add mining models using Decision Trees and compare to each other and to NN. Competitive Tools Rapid Miner is gaining popularity as a full-featured GUI to well-known algorithms. WEKA is frequently used in academia. R is a programming language with a large library of statistical and predictive modeling algorithms. Text Mining What about text mining? Word List Document Categorization Novelty Sentiment Quiz What is the purpose of the training set? What is the difference between Supervised and Unsupervised learning? What are some techniques to help avoid Overfitting? Model Development Exercise #3 Training Data Set: churn table Test Data Set: churn_test table Predict whether a phone customer will cancel their service. Purpose: Learn how to discretize and balance input data. Create Mining Structure with 0% holdout. Add mining models: Decision Tree – Cluster – Logistic Regression Process Model Development Exercise #3 Evaluate using the classification matrix with the training dataset. Note the results and the percent correct each model produced. Evaluate using Lift Chart. Evaluate with Profit Chart by entering the following parameters: • Population: 10,000 • Fixed Cost: 100,000 • Individual Cost: 50 • Revenue per Individual: 500 Repeat evaluations using the test data set with the settings above. Balance the training data set. Re-create new mining structure, process and test the models again. Deploy - Monitor - Use The predictive models are in an Analysis Services database, so deployment is similar to cube deployment. Deploy to Server 1) Project – Properties – Deployment – Server 2) Build – Deploy Backup\Restore Analysis Services Database Execute XMLA Script Deploy - Monitor - Use Predictive models will degrade over time, and may be completely disrupted by expected or unexpected events. Track When a the predictions is made, save the input and output to a database table. Actuals When the actual result is known, store it with the prediction. Mean Error Report the mean error periodically. Replace\Refactor the model when the error is worse by a statistically significant number or a negative trend appears. Deploy - Monitor - Use Exercise: Use one of the models you’ve developed in a SSRS report, an SSIS package, or in .NET code. SSRS Connect to your Analysis Services database and write a query using DMX. SSIS Use a Data Mining Query data flow task to make predictions as data is flowing through an ETL job. .NET Use ADOMD (Microsoft.AnalysisServices.AdomdClient) to query the predictive model using a dynamic DMX query. Model Development Exercise #4 Training Data Set: movies, actors, tech tables Test Data Set: none Predict which movie to recommend based on a customer’s demographic info, movie buying history, and favorite actors. Purpose: Learn how to use the association algorithm and nested data. Create Mining Structure with 0% holdout. Add mining models: Association Process / Evaluate Model Development More Labs Predict auto characteristics auto db table Predict political affiliation VotingRecords db table Make your own model BYOD Summary Data Science is a changing filed that is evolving very fast New Skill Sets are required on top of old ones Thank You / Questions Carlos Bossy Quanta Intelligence Email: cbossy@quantaintelligence.com Blog: www.carlosbossy.com Twitter: @carlosbossy