DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han and Kamber; Cerrito; SAS Education slide 1 DSCI 4520/5240 DATA MINING ITDS Résumé Book ITDS majors (BCIS/DS), please send your résumé to melody.white@unt.edu, so that we can include it to the ITDS Résumé Book we send to our corporate partners for hiring/coop consideration. Make sure the résumés are formatted per UNT standards. Here is a link to the sample résumés: https://unt.optimalresume.com/ slide 2 DSCI 4520/5240 DATA MINING Data (and the lack thereof) “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia") http://www.dilbert.com/2012-12-05/ slide 3 DSCI 4520/5240 DATA MINING Data (and the lack thereof) http://www.dilbert.com/2012-12-05/ slide 4 DSCI 4520/5240 DATA MINING Nobel Laureate Calls Data Mining "A Must" In an interview with ComputerWorld in January 1999, Dr. Penzias (won the 1978 Nobel Prize in physics and was the vice president and chief scientist at Bell Laboratories) considered large scale data mining from very large databases as the key application for corporations in the next few years. In response to ComputerWorld's age-old question of "What will be the killer applications in the corporation?" Dr. Penzias replied: "Data mining." He then added: "Data mining will become much more important and companies will throw away nothing about their customers because it will be so valuable. If you're not doing this, you're out of business" he said. slide 5 DSCI 4520/5240 DATA MINING What Is Data Mining? Data mining (knowledge discovery in databases): A process of identifying hidden patterns and relationships within data (Groth) Data mining: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases slide 6 DSCI 4520/5240 DATA MINING Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories Problem: We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases slide 7 Data Deluge DSCI 4520/5240 DATA MINING slide 8 DSCI 4520/5240 DATA MINING Data Mining, circa 1963 IBM 7090 600 cases “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.” slide 9 DSCI 4520/5240 DATA MINING Business Decision Support Database Marketing – Target marketing – Customer relationship management Credit Risk Management – Credit scoring Fraud Detection Healthcare Informatics – Clinical decision support slide 10 DSCI 4520/5240 DATA MINING Required Expertise Domain Data Analytical Methods slide 11 Multidisciplinary DSCI 4520/5240 DATA MINING Statistics Pattern Neurocomputing Recognition Machine Data Mining Learning AI Databases KDD slide 12 What Is Data Mining? DSCI 4520/5240 DATA MINING IT: Complicated database queries ML: Inductive learning from examples Stat: What we were taught not to do slide 13 DSCI 4520/5240 DATA MINING Comparing Statistics to Data Mining (from Cerrito 2006) slide 14 DSCI 4520/5240 DATA MINING Comparing Statistics to Data Mining (from Cerrito 2006) slide 15 Predictive Modeling DSCI 4520/5240 DATA MINING Inputs Cases .. .. .. .. .. .. .. .. .. . . . . . . . . . Target ... ... ... ... ... ... ... ... ... ... .. .. . . slide 16 DSCI 4520/5240 DATA MINING Types of Targets Supervised Classification – Event/no event (binary target) – Class label (multiclass problem) Regression – Continuous outcome Survival Analysis – Time-to-event (possibly censored) slide 17 DSCI 4520/5240 DATA MINING Why Data Mining? — Potential Applications Database analysis and decision support Market analysis and management – target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management – Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering slide 18 DSCI 4520/5240 DATA MINING Market Analysis and Management (1) Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information slide 19 DSCI 4520/5240 DATA MINING Market Analysis and Management (2) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers slide 20 DSCI 4520/5240 DATA MINING Corporate Analysis and Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market slide 21 Other Applications DSCI 4520/5240 DATA MINING Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. slide 22 DSCI 4520/5240 DATA MINING On the News: Rexer Analytics Annual Data Mining survey The 2013 survey will become available in Fall 2013 (stay tuned) slide 23 DSCI 4520/5240 DATA MINING Rexer Analytics 2011 Survey Overview • SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in 2011. Participants: 1,319 data miners from over 60 countries. • FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years. “Improving the understanding of customers”, “retaining customers” and other CRM goals continue to be the primary goals. • ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the top three algorithms for most data miners. A third of data miners currently use text mining and another third plan to do so in the future. • TOOLS: R continued its rise this year and is now being used by close to half of all data miners (47%). R users prefer it for being free, open source, and having a wide variety of algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA, KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings. • ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of corporate respondents rate their company as having very high analytic sophistication. Measures of analytic success: Return on Investment (ROI), and predictive validity or accuracy of their models. Challenges to measuring success: user cooperation and data availability/quality. slide 24 DSCI 4520/5240 DATA MINING Where Data Miners Work Data Mining is everywhere! Data miners also report working in Non-profit (6%), Hospitality / Entertainment / Sports (3%), Military / Security (3%), and Other (9%). © 2012 Rexer Analytics slide 25 DSCI 4520/5240 DATA MINING © 2012 Rexer Analytics The Algorithms Data Miners use slide 26 DSCI 4520/5240 DATA MINING The positive impact of Data Mining In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data miners from over 60 countries) data miners shared examples of situations where data mining is having a positive impact on society. The five areas mentioned most often were: Health / Medical Progress Business Improvements Personalized Communications & Marketing Fraud Detection Environmental slide 27 DSCI 4520/5240 DATA MINING The rise of Text Mining Text Material No Plans to Conduct Text Mining 33% Text Miners 34% 33% Customer / market surveys Blogs and other social media E-mail or other correspondence News articles Scientific or technical literature Web-site feedback Online forums or review sites Contact center notes or transcripts Employee surveys Insurance claims or underwriting notes Medical records Point of service notes or transcripts 38% 33% 27% 25% 23% 22% 21% 16% 15% 15% 11% 10% Plan to Start Text Mining © 2012 Rexer Analytics slide 28 Data Mining Software DSCI 4520/5240 DATA MINING • The average data miner reports using 4 software tools. • R is used by the most data miners (47%). Overall © 2012 Rexer Analytics Corporate Consultants Academics NGO / Gov’t slide 29 29 DSCI 4520/5240 DATA MINING Satisfaction with Data Mining Tools Extremely Dissatisfied © 2012 Rexer Analytics Extremely Satisfied slide 30 Measuring Analytic Success DSCI 4520/5240 DATA MINING Question: Please share your best practices concerning how you measure analytic project performance / success. (text box provided for response) 53 Model Performance (Accuracy, F, ROC,AUC, Lift) 43 Financial Performance (ROI, etc.) Performance in Control or Other Group 35 Feedback from User / Client / Management 29 14 Cross-Validation 0 10 20 30 40 50 Number of respondents © 2012 Rexer Analytics slide 31 60 DSCI 4520/5240 DATA MINING Overcoming Data Mining challenges In the four annual data miner surveys, these key challenges have been identified by data miners more than any others: Dirty Data Explaining Data Mining to Others Unavailability of Data / Difficult Access to Data slide 32 DSCI 4520/5240 DATA MINING Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases slide 33 DSCI 4520/5240 DATA MINING Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing data mining algorithms summarization, classification, regression, association, clustering. Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge slide 34 DSCI 4520/5240 DATA MINING Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA slide 35