A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor 13 June 2013 | Virtual Business Analytics Chapter About MarkTab Training and Consulting with http://marktab.com Data Mining Resources and Blog at http://marktab.net Ph.D. – Industrial Engineering, Georgia Tech Training and consulting internationally across many industries – SAS and Microsoft Contributed to peer-reviewed research and legislation ◦ Mentoring doctoral dissertations at the accredited University of Phoenix Presenter About Artus Assistant Professor for Analytic Information Systems and Business Intelligence PhD in computer science Research: data mining for e-commerce and mobile business Consultant Section One DATA MINING FOUNDATION 4 Definition 1 (Informal) Data mining is the automated or semi-automated process of discovering patterns in data. Definition 2 Data Mining is a process using 1. Exploratory Data Analysis Statistical and visual data analysis techniques. Forming a hypothesis 2. Data Modeling & Predictions Describe data using probability distributions and Machine Learning algorithms (“model”). Fitting a hypothesis 3. Statistical Learning Theory Model selection, model evaluation 6 Data Mining Visualized f( ) Input Target Target: attribute we are interested in. Input: data available for our predictions. Function f: describes the relationship between target and input. Regrettably, f is unknown and unknowable. 7 Data Mining Visualized Real world: Unknown f ( ) Input Target Data Mining model: Hypothesis h ( Need to find “good” h. h is your DM “algorithm”. ) Input data has to be appropriate. Select and transform as needed Correct modeling of target is crucial 8 Top 10 Expectations BEST PRACTICE: LEARN FROM EXPERIENCE 9 Expectation Ten Marketing More Scientific •People can start data •Better models come mining in 10 minutes… from days, weeks or months of iterative improvement 10 Expectation Nine Marketing More Scientific •Data miners can •Knowing the industry provide provably good and organizational models with little or goals helps orient the zero knowledge of the questions, modeling, specific industry… and analysis. 11 Expectation Eight Marketing •Open source software can provide quality results worthy of peerreviewed literature… More Scientific •Commercial software with years-long service options is required for enterprise scale. 12 Expectation Seven Marketing •We can learn a lot from the current data warehouses, cubes, and big data… More Scientific •We can improve our modeling by creating new data collection strategies. 13 Expectation Six Marketing •People can build data mining models with little or zero data cleaning… More Scientific •Better results happen when we organize and rearrange data for best success. 14 Expectation Five Marketing •Data mining can provide answers to problems… More Scientific •Most times we only get detail insights toward larger problems, and sometimes uncover more problems than we started with. 15 Expectation Four Marketing •A little data mining knowledge can provide an organization with a competitive edge… More Scientific •The edge grows along with experience and better study of the methodology and mathematics. 16 Expectation Three Marketing •Individual professionals can deliver excellent predictive analysis… More Scientific •Small teams working together can help quickly and efficiently conquer some of the most difficult analytic challenges. 17 Expectation Two Marketing •Numbers speak for themselves and can influence better decision making… More Scientific •Leadership strategy helps teams deliver results in the best way given the current culture. 18 Expectation One Marketing •A lot of data mining best practices and strategies can be communicated in an hour or a day… More Scientific •The best commitment is ongoing education on both data mining and machine learning technology. 19 Section Two ANALYZING AND PREPARING DATA 20 Best practice: study individual attributes Histograms and frequencies (discrete) Kernel density estimates Cumulative distribution function Rank-order plots and lift charts Summary statistics (continuous) Box-and-whisker plots 21 Best practice: study combinations Pivot tables Scatter plots Logarithmic plots Naïve Bayes Correlation matrices False-Color plots Scatter-Plot matrix Co-plot 22 Section Three MACHINE LEARNING ALGORITHMS 23 How to Choose an Algorithm Choosing an algorithm or series of algorithms is an art One algorithm could perform different tasks Be willing to experiment with algorithms and algorithm parameters 24 Algorithms for Data Mining Tasks (1 of 2) Algorithm Description Name Microsoft Time Analyzes time-related data by using a linear decision tree. Series Patterns can be used to predict future values in the time series. Microsoft Makes predictions based on the relationships between columns in the dataset, and models Decision Trees the relationships as a tree-like series of splits on specific values. Supports the prediction of both discrete and continuous attributes. Microsoft Linear Regression If there is a linear dependency between the target variable and the variables being examined, finds the most efficient relationship between the target and its inputs. Supports prediction of continuous attributes. Microsoft Clustering Identifies relationships in a dataset that you might not logically derive through casual observation. Uses iterative techniques to group records into clusters that contain similar characteristics. Algorithms for Data Mining Tasks (2 of 2) Algorithm Name Description Microsoft Naïve Bayes Finds the probability of the relationship between all input and predictable columns. This algorithm is useful for quickly generating mining models to discover relationships. Supports only discrete or discretized attributes. Treats all input attributes as independent. Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values, usually the occurrence or non-occurrence of an event. Supports the prediction of both discrete and continuous attributes. Microsoft Logistic Regression Microsoft Neural Network Microsoft Association Rules Microsoft Sequence Clustering Analyzes complex input data or business problems for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms. Can predict multiple attributes. Can be used to classify discrete attributes and regression of continuous attributes. Builds rules that describe which items are likely to appear together in a transaction. Identifies clusters of similarly ordered events in a sequence. Provides a combination of sequence analysis and clustering. Best practice: Document your science Describe the business problem Determine how to measure success (including baseline) Document what was learned during data preparation and analysis Justify the algorithms used during the investigation List assumptions were made 27 Section Four ACHIEVING BUSINESS VALUE 28 Leadership challenges Build on organizational communications Consider redoing analysis Find results champions Celebrate the results 29 Best practice: prepare the next cycle Note strengths, weaknesses, opportunities, risks Build consensus on model expiration dates Encourage and improve the process Create insight into new future data collection 30 Conclusion Best Practices Framework Provide a data mining foundation Prepare the data Evaluate machine learning output Plan to move toward actionable decisions 31 Resources http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libs http://www.enthought.com/products/epd.php Commercial Python http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorial http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data Mining http://marktab.net Data Mining Portal http://sqlserverdatamining.com Data Mining Team Portal Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical Time Series Forecasting” 32