Why Data Mining? ◼ We are living in the Big Data Age! ◼ The Explosive Growth of Data: TB of data is generated by the second ◼ The data is also very complex ◼ ◼ ◼ ◼ ◼ Multiple types of data: tables, text, time series, images, voice, etc. Spatial and temporal aspects. Interconnected data of different types. Need to analyze the raw data to extract knowledge We are drowning in data, but starving for knowledge! ◼ Data rich but information poor! What does those data mean? ◼ How to analyze data? Data mining — Automated analysis of massive data sets to harness the collective intelligence. ◼ ◼ ◼ “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Why Data Mining? ◼ The Explosive Growth of Data ◼ Major sources of abundant data ◼ Business: Web, e-commerce, transactions, stocks, … ◼ Science: Remote sensing, bioinformatics, scientific simulation, … ◼ Society and everyone: news, digital cameras, YouTube 2 What Is Data Mining? ◼ Data mining (knowledge discovery from data) ◼ ◼ ◼ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data To refer to the mining of gold from rocks or sand, we say “gold mining” instead of rock or sand mining. Analogously , data mining should have been more appropriately named “knowledge mining from data,” Alternative names ◼ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 3 What Is Data Mining? Business Perspective ◼ Data Mining is a business process by which organizations extract value from data to increase profit or efficiency, manage risk, improve service, formulate strategy and increase knowledge. ◼ Data mining is a business process that uses business knowledge to create new knowledge. ◼ New knowledge is created by discovering and interpreting patterns in data. 4 Terminology Defintion ◼ ◼ ◼ ◼ ◼ The terms artificial intelligence, machine learning, and data mining are often grouped together or used interchangeably because their definitions tend to overlap with no clear boundaries. The following general definitions for these terms simply reinforce this fact: Artificial Intelligence (AI) ◼ Demonstrate human-like intelligence and cognitive functions ◼ Deduction, pattern recognition, and interpretation of complex data ◼ Examples: Deep Blue playing chess, Watson Machine Learning ◼ Application of AI that allows the computer to learn from data automatically ◼ Uncover hidden patterns and relationships ◼ Use self-learning algorithms to evaluate results and improve performance over time ◼ Example: predict rider demand to strategically dispatch drivers for Uber Data Mining ◼ Process of applying a set of analytical techniques (AI, ML) for specific goal. ◼ Uncover hidden patterns and relationships in data ◼ Data segmentation, pattern recognition, classification, prediction ◼ Example: group customers into segments for customized promotions These are often grouped together or used interchangeably. 5 Data Mining vs. Data Analysis ◼ Data Analysis “Statistics” is about hypothesis testing “Confirmative” ◼ ◼ We start with a theory or idea about how something works Data Mining is about hypothesis generation “Explorative” ◼ Discovering connections and relationship not previously known. Data Analysis “Statistics” DATA MINING Confirmative Explorative Techniques are not optimized for large amounts of data. Small data sets. Can find patterns in very large amounts of data Small number of variables Numeric data Large number of variables Requires understanding of data and business problem Numeric and non-numeric Clean data Data cleaning Requires strong statistical skills Data Mining Vs. Business Intelligence ◼ Business Intelligence broadly encompassing reporting techniques and simple visualization when a relationship or a set of relationship is under investigation ◼ ◼ ◼ Sales data broken down by product region, time and personnel. BI gives you an instant picture of sales performance and trends. Data Mining is used when key relationships are not known, for example, what factors drive sales. ◼ ◼ There might be tens or hundreds of factors to be considered. DM helps you sift through these factors and find the one or combined factors which have important effects. Most Common Data Mining Applications. ◼ ◼ The areas of business in which data mining most commonly delivers benefit is Customer Analytics. ◼ Sometimes known analytical CRM or Customer Relationship Management. Organization use DM to help them understand their customers, predict their behavior, forge stronger and more profitable Customer relationships. Some of the common modeling techniques in CA: ◼ Segmentation: Discovering through DM what different types of customers the organization has and what market segment can be targeted. ◼ Response Modeling: Discovering who is most likely to respond to a marketing campaign. This can be used for strategy or induvial level to target with a specific marketing message. ◼ Attrition Modeling helps organizations understand why customers leave the customer base. Target retention campaign towards the customers most at risk. Most Common Data Mining Applications. Risk and Fraud is the 2nd most common application for DM. ◼ Risk Management: is an important aspect to every business especially financial services. ◼ ◼ ◼ ◼ Credit risk (some people default on a debit). Market risk due to market movement. Data Mining is particularly useful when insight into sources of risk is required, and risk assessment needs to be integrated into operational systems. Fraud Detection: is the risk of loss through fraud. ◼ ◼ Data mining help organization to detect and prevent fraud that is buried in a large value of transactions. Detecting fraudulent or suspect transactions depends on two types of recognition: 1. Known patterns: Recognize pattern of frauds that have been seen before, this is similarity to known types of fraud that have been seen in previous knwnw cases of fraud. 2. Unknown patterns: Discover anomalies which represent types of fraud which have not been seen and detected before. Data Mining Tasks ◼ Descriptive data mining also called unsupervised learning techniques. ◼ ◼ ◼ These tasks characterize properties of the data in a target data set. The objective is to derive patterns (associations, trends, clusters, and anomalies) that can summarize the underlying relationships in data. Predictive mining tasks also called supervised learning techniques. ◼ ◼ perform induction on the current data in order to make predictions. The objective of these tasks is to predict the value of a particular attribute based on the values of other attributes. 10 Data Mining Tasks ◼ There are various data mining functionalities and tasks that can discover different kinds of patterns: ◼ Unsupervised Learning: ◼ ◼ ◼ ◼ Association Analysis: the task of discovering patterns that describe relationships. Anomaly Detection: the task of detecting unusual deviations. Clustering: the task of discovering groups and structures. Supervised Learning: ◼ ◼ Classification: the task of assigning (discrete) target variables to one of several predefined categories. Regression: the task of finding a function that models (continuous) target variables. 11 August Data Mining: 30, 2022 Concepts and Techniques Data Mining Process ◼ ◼ Data mining is a complex process of examining large sets of data for identifying patterns and then using them for valuable business insights. Established standards with defined steps ◼ Cross-Industry Standard Process for Data Mining (CRISPDM) ◼ Sample, Explore, Modify, Model, and Assess (SEMMA): SAS ◼ knowledge discovery from data (KDD) 12 Data Mining Process August Data Mining: 30, 2022 Concepts and Techniques Data Mining Process CRISP-DM 14 Data Mining Process ◼ ◼ ◼ ◼ ◼ ◼ Business understanding: situational context, specific objectives, project schedule, deliverables Data understanding: collecting raw data, preliminary results, potential hypotheses Data preparation: record and variable selection, wrangling, cleaning Modeling: selection and execution of data mining techniques, convert or transform data to formats/types needed for certain analyses, document assumptions, crossvalidation Evaluation: evaluate performance of competing models, select best models, review and interpret results, develop recommendations Deployment: develop a set of actionable insights and a strategy for deployment/monitoring/feedback 15 Data Mining Overview SEMMA ◼ ◼ ◼ ◼ ◼ Sample: identify appropriate variables, merging and/or dividing, sample Explore: exploratory data analysis Modify: variables are selected, created, and/or transformed Model: analysis techniques and models are chosen and applied Assess: results from different models are presented to end users, compare outcomes and performance, new observations are scored