INTRODUCTION TO DATA AND TEXT MINING ANDREW PEASE, 8 MARCH 2013 C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Design of Experiments Survey Data Analysis Analysis of Variance Spectral Analysis Social Network Analysis R Integration Statistical Analysis Nonlinear Sentiment Analysis Data Visualization Vector Autoregressive Models Discrete Event Simulation Network Flow Models Predictive Modeling Econometrics Sample Size Computations Data Mining Statistics Scoring Bayesian Acceleration Ensemble Models Text Analytics Decision Trees Descriptive Modeling Gradient Boosting Machines Linear Programming Interactive Matrix Programming Matrix Programming Multinomical Discrete Choice Scheduling Reliability Analysis Cluster Analysis Mixed-Integer Programming Neural Networks Forecasting Process Capability Analysis Programming Exploratory Data Analysis Nonparametric Analysis Categorical Data Analysis Statistical Process Control Predictive Analytics X11 & X12 Models Survival Analysis D-Optimal High Performance Forecasting Information Psychometric Analysis Theory Mixed Models Multivariate Analysis Quality Content Improvement Interior-Point Models Study Planning Analysis of Means Genetic Algorithms ARIMA Models Random Forrests Categorization Operations Research Content Categorization Ontology Managemen Discrete Event Simulation Association & Sequence Analysis Constraint Programming Automated Scoring C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Time Series Analysis Ontology Management Simulation Model Management Regression Large-Scale Forecasting Fractional Factorial DATA MINING IS: Discovering patterns, trends and relationships represented in data Developing models to understand and describe characteristics and activity based on these patterns Use insights to help evaluate future options and take fact-based decisions Deploy scores and results for timely, appropriate action …. Past Future …. time…. Observed Events C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Predicted Events INDUSTRY SPECIFIC DATA MINING APPLICATIONS Application What is Predicted? Driven Business Decision Credit Scoring (Banking) Measure credit worthiness of new and existing set of customers How to assess and control risk within existing (or new) consumer portfolios? Market Basket Analysis (Retail) Which products are likely to purchased together? How to increase sales with cross-sell/up-sell, loyalty programs, promotions? Asset Maintenance (Utilities, Mfg., Oil & Gas) Identify real drivers of asset or equipment failure How to minimize operational disruptions and maintenance costs? Health & Condition Mgmt. (Health Insurance) Identify patients at risk of a chronic illness & offer treatment program How can we reduce healthcare costs and satisfy patients? Fraud Mgmt. (Govt., Detect unknown fraud Insurance, Banks) cases and future risks How to decrease fraud losses and lower false positives? Drug Discovery (Life Science) How to bring drugs quickly and effectively to the marketplace? C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Find compounds that have desirable effects & detect drug behavior during trials DATA MINING METHODOLOGY SEMMA C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . G Q C A O V S A F Q T W M Z P H D L E P C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . E E N C H J W K W S W T X F Y U V J I F V D H T R Y B T Y T G F G E M T U M N E H G A R W I I A H M U J L T S P N P K X I K O N D Q S I D T B O J J F A W O N R C I U H M P B I Q G X U T Y N G C U U E A T Q U B F Z X P O SAS TEXT ANALYTICS: UNCOVERING THE TECHNOLOGY Content Categorization C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . Sentiment Analysis Text Mining Ontology Management LILLEBAELT HOSPITAL (Denmark) • Reduce error in patient records • Reduce manual effort of patient record audits RESULTS • “If data is wrong, the basis for decision making is also faulty. Therefore, the Clinically Correct Time-True Registration system makes sense even beyond our department and hospital.” - Sten Larsen, Chief Surgeon • Creation of database to improving clinical work in research and diagnosis C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . HEALTHCARE BUSINESS ISSUE 1823 HONG KONG EFFICIENCY UNIT • 1823 operates round-theclock, including during Sundays and public holidays. • Answers 2.65 million calls and 98.000 e-mails, including inquiries, suggestions and complaints • Developed a Compliant Intelligence System that uncovers the trends, patterns and relationships inherent in the complaints C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . RESULTS • "By decoding the 'messages' through statistical and root-cause analyses of complaints data, the government can better understand the voice of the people, and help government departments improve service delivery, make informed decisions and develop smart strategies. This in turn helps boost public satisfaction with the government, and build a quality city.” - Efficiency Unit’s Assistant Director, W. F. Yuk PUBLIC BUSINESS ISSUE DATA/TEXT MINING RESEARCH CONSIDERATIONS • • • • • • • Data Mining for patent research/control Copyright research/control Metadata-driven approach avoids ‘permanent’ data duplication Analyst needs ‘creative freedom’ in combining, transforming data User interfaces – programming vs point-and-click Cost to implement highly variable Future Indications • In-Memory • Big Data • Cloud Com C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d . www.SAS.com